docs: add MIT LICENSE and prep for public release by pekkah · Pull Request #1 · pekkah/SharpInference

pekkah · 2026-04-15T14:00:03Z

Add MIT LICENSE (required before flipping repo to public)
README: link to LICENSE; clarify CUDA DLL requirements (NVRTC resolver
also supports CUDA 12.x / 11.2+, cublas/cudart remain CUDA 11)
CLAUDE.md, design doc: fix upscaler example file extension
(.pth -> .safetensors; RRDBNet.Load only supports safetensors)

- Add MIT LICENSE (required before flipping repo to public) - README: link to LICENSE; clarify CUDA DLL requirements (NVRTC resolver also supports CUDA 12.x / 11.2+, cublas/cudart remain CUDA 11) - CLAUDE.md, design doc: fix upscaler example file extension (.pth -> .safetensors; RRDBNet.Load only supports safetensors)

Adds a Helper Scripts section to README describing download-model.ps1, setup-openblas.ps1, setup-llamacpp.ps1, generate-reference-logits.ps1, and the two Python helpers, with a typical first-time-setup snippet.

Work item #1 of #209: amortize the dominant CPU mmap dense-FFN weight read across four MTP draft tokens. The 27B-MTP CUDA-hybrid decode cost center is the 46/64 CPU-resident FFN layers (~6 GB/token); the prior MatVec2In only pair-amortized that read, so the verify optimum sat at k=2 (10.1 t/s) and deeper chains lost to the linear-in-k re-stream. - SimdKernels.MatVec4In + register-tiled DotQ5K_4In (AVX-512) / DotQ6K_Q8K_4In (AVX2); Q4_K reuses the existing #114 DotQ4K_4In, F32 via 4x DotF32. Each decodes one weight block once and FMAs four input columns, bit-identical per slot to a single MatVec (per-position bits are k-parity-independent). Q8_0 deliberately routes to the dequant fallback to stay bit-identical to single-token dense-FFN decode (MatVec/MatVecDual), which never specialized it. - CpuDenseFfn4 / DenseFfn4 + a 4-wide lm_head replace the 2-wide loops in both GDN passes' BatchVerify. The three hand-copied duplicated-input-tail idioms collapse into one shared MtpBatchTail.Group4 helper (clamps past-the-end lanes to the last real token, routes their output to a sink). - Default verify batch moves k=2 -> k=4 (ResolveMtpBatchMax 2->4, ResolveDraftN 1->3) now that the 4-input kernel makes k=4 the measured optimum. The GDN ring alloc stops on OOM and clamps MaxBatchVerifyTokens, so tight-VRAM cards degrade gracefully. - Amputate the dead public IForwardPass.LastHiddenT1 accessor (no consumers; MtpDecoder drives BatchVerify/HiddenAt). The CPU pass's backing buffer is removed; the CUDA pass keeps _lastHiddenT1/_gpuLastHiddenT1 as internal SHARPI_CPU_GDN=1 debug-trunk scratch. Bench (27B Q4_K_M CUDA-hybrid, RTX 4070 Ti, -g -1 --no-thinking): decode MTP-off 6.5, k=2 10.1, k=4 12.3, k=6 10.4, k=8 10.2 t/s. New default 12.3 t/s (84% accept) = 1.9x over MTP-off, +22% over the old k=2 default. Tests: MatVec4In_BitwiseMatchesSingleMatVec (Q4_K/Q5_K/Q6_K/Q8_0/F32, serial + Parallel.For) and MtpBatchTail lane-mapping (k=1..9) are bitwise oracles; the CUDA k=4 batched-verify suite stays green. MtpDecoder_GreedyParity_LlamaCpp is untouched and unaffected (both 2In and 4In are per-token bit-identical to single MatVec, so the CPU pass emits identical per-token logits). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

claude added 5 commits April 15, 2026 13:58

docs: document helper scripts in scripts/ directory

9dab550

Adds a Helper Scripts section to README describing download-model.ps1, setup-openblas.ps1, setup-llamacpp.ps1, generate-reference-logits.ps1, and the two Python helpers, with a typical first-time-setup snippet.

docs: add experimental/WIP status disclaimer to README

626f00e

docs: reframe README disclaimer as a spike

47e4268

docs: note that the ASP.NET server host is untested end-to-end

3554cc4

pekkah merged commit e8620ca into master Apr 15, 2026
1 check passed

pekkah deleted the claude/prepare-public-release-PXoaT branch April 15, 2026 14:06

This was referenced May 31, 2026

Evaluate NVIDIA Model-Optimizer techniques for low-VRAM (12GB/64GB) perf #104

Open

Tighten DotQ3K_Q8K / DotQ8_0_Q8K parity so auto-on can be re-tried (#103) #107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add MIT LICENSE and prep for public release#1

docs: add MIT LICENSE and prep for public release#1
pekkah merged 5 commits into
masterfrom
claude/prepare-public-release-PXoaT

pekkah commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pekkah commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants