docs: add MIT LICENSE and prep for public release#1
Merged
Conversation
- Add MIT LICENSE (required before flipping repo to public) - README: link to LICENSE; clarify CUDA DLL requirements (NVRTC resolver also supports CUDA 12.x / 11.2+, cublas/cudart remain CUDA 11) - CLAUDE.md, design doc: fix upscaler example file extension (.pth -> .safetensors; RRDBNet.Load only supports safetensors)
Adds a Helper Scripts section to README describing download-model.ps1, setup-openblas.ps1, setup-llamacpp.ps1, generate-reference-logits.ps1, and the two Python helpers, with a typical first-time-setup snippet.
This was referenced May 31, 2026
This was referenced Jun 14, 2026
pekkah
added a commit
that referenced
this pull request
Jun 17, 2026
Work item #1 of #209: amortize the dominant CPU mmap dense-FFN weight read across four MTP draft tokens. The 27B-MTP CUDA-hybrid decode cost center is the 46/64 CPU-resident FFN layers (~6 GB/token); the prior MatVec2In only pair-amortized that read, so the verify optimum sat at k=2 (10.1 t/s) and deeper chains lost to the linear-in-k re-stream. - SimdKernels.MatVec4In + register-tiled DotQ5K_4In (AVX-512) / DotQ6K_Q8K_4In (AVX2); Q4_K reuses the existing #114 DotQ4K_4In, F32 via 4x DotF32. Each decodes one weight block once and FMAs four input columns, bit-identical per slot to a single MatVec (per-position bits are k-parity-independent). Q8_0 deliberately routes to the dequant fallback to stay bit-identical to single-token dense-FFN decode (MatVec/MatVecDual), which never specialized it. - CpuDenseFfn4 / DenseFfn4 + a 4-wide lm_head replace the 2-wide loops in both GDN passes' BatchVerify. The three hand-copied duplicated-input-tail idioms collapse into one shared MtpBatchTail.Group4 helper (clamps past-the-end lanes to the last real token, routes their output to a sink). - Default verify batch moves k=2 -> k=4 (ResolveMtpBatchMax 2->4, ResolveDraftN 1->3) now that the 4-input kernel makes k=4 the measured optimum. The GDN ring alloc stops on OOM and clamps MaxBatchVerifyTokens, so tight-VRAM cards degrade gracefully. - Amputate the dead public IForwardPass.LastHiddenT1 accessor (no consumers; MtpDecoder drives BatchVerify/HiddenAt). The CPU pass's backing buffer is removed; the CUDA pass keeps _lastHiddenT1/_gpuLastHiddenT1 as internal SHARPI_CPU_GDN=1 debug-trunk scratch. Bench (27B Q4_K_M CUDA-hybrid, RTX 4070 Ti, -g -1 --no-thinking): decode MTP-off 6.5, k=2 10.1, k=4 12.3, k=6 10.4, k=8 10.2 t/s. New default 12.3 t/s (84% accept) = 1.9x over MTP-off, +22% over the old k=2 default. Tests: MatVec4In_BitwiseMatchesSingleMatVec (Q4_K/Q5_K/Q6_K/Q8_0/F32, serial + Parallel.For) and MtpBatchTail lane-mapping (k=1..9) are bitwise oracles; the CUDA k=4 batched-verify suite stays green. MtpDecoder_GreedyParity_LlamaCpp is untouched and unaffected (both 2In and 4In are per-token bit-identical to single MatVec, so the CPU pass emits identical per-token logits). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
This was referenced Jun 17, 2026
Open
perf(mtp): batched-verify logits-buffer reuse + PromptLookup incremental index (#209 items 5,6)
#291
Open
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
also supports CUDA 12.x / 11.2+, cublas/cudart remain CUDA 11)
(.pth -> .safetensors; RRDBNet.Load only supports safetensors)