Skip to content

docs: add MIT LICENSE and prep for public release#1

Merged
pekkah merged 5 commits into
masterfrom
claude/prepare-public-release-PXoaT
Apr 15, 2026
Merged

docs: add MIT LICENSE and prep for public release#1
pekkah merged 5 commits into
masterfrom
claude/prepare-public-release-PXoaT

Conversation

@pekkah

@pekkah pekkah commented Apr 15, 2026

Copy link
Copy Markdown
Owner
  • Add MIT LICENSE (required before flipping repo to public)
  • README: link to LICENSE; clarify CUDA DLL requirements (NVRTC resolver
    also supports CUDA 12.x / 11.2+, cublas/cudart remain CUDA 11)
  • CLAUDE.md, design doc: fix upscaler example file extension
    (.pth -> .safetensors; RRDBNet.Load only supports safetensors)

claude added 5 commits April 15, 2026 13:58
- Add MIT LICENSE (required before flipping repo to public)
- README: link to LICENSE; clarify CUDA DLL requirements (NVRTC resolver
  also supports CUDA 12.x / 11.2+, cublas/cudart remain CUDA 11)
- CLAUDE.md, design doc: fix upscaler example file extension
  (.pth -> .safetensors; RRDBNet.Load only supports safetensors)
Adds a Helper Scripts section to README describing download-model.ps1,
setup-openblas.ps1, setup-llamacpp.ps1, generate-reference-logits.ps1,
and the two Python helpers, with a typical first-time-setup snippet.
@pekkah pekkah merged commit e8620ca into master Apr 15, 2026
1 check passed
@pekkah pekkah deleted the claude/prepare-public-release-PXoaT branch April 15, 2026 14:06
pekkah added a commit that referenced this pull request Jun 17, 2026
Work item #1 of #209: amortize the dominant CPU mmap dense-FFN weight read
across four MTP draft tokens. The 27B-MTP CUDA-hybrid decode cost center is the
46/64 CPU-resident FFN layers (~6 GB/token); the prior MatVec2In only
pair-amortized that read, so the verify optimum sat at k=2 (10.1 t/s) and deeper
chains lost to the linear-in-k re-stream.

- SimdKernels.MatVec4In + register-tiled DotQ5K_4In (AVX-512) / DotQ6K_Q8K_4In
  (AVX2); Q4_K reuses the existing #114 DotQ4K_4In, F32 via 4x DotF32. Each
  decodes one weight block once and FMAs four input columns, bit-identical per
  slot to a single MatVec (per-position bits are k-parity-independent).
  Q8_0 deliberately routes to the dequant fallback to stay bit-identical to
  single-token dense-FFN decode (MatVec/MatVecDual), which never specialized it.
- CpuDenseFfn4 / DenseFfn4 + a 4-wide lm_head replace the 2-wide loops in both
  GDN passes' BatchVerify. The three hand-copied duplicated-input-tail idioms
  collapse into one shared MtpBatchTail.Group4 helper (clamps past-the-end lanes
  to the last real token, routes their output to a sink).
- Default verify batch moves k=2 -> k=4 (ResolveMtpBatchMax 2->4,
  ResolveDraftN 1->3) now that the 4-input kernel makes k=4 the measured optimum.
  The GDN ring alloc stops on OOM and clamps MaxBatchVerifyTokens, so tight-VRAM
  cards degrade gracefully.
- Amputate the dead public IForwardPass.LastHiddenT1 accessor (no consumers;
  MtpDecoder drives BatchVerify/HiddenAt). The CPU pass's backing buffer is
  removed; the CUDA pass keeps _lastHiddenT1/_gpuLastHiddenT1 as internal
  SHARPI_CPU_GDN=1 debug-trunk scratch.

Bench (27B Q4_K_M CUDA-hybrid, RTX 4070 Ti, -g -1 --no-thinking): decode
MTP-off 6.5, k=2 10.1, k=4 12.3, k=6 10.4, k=8 10.2 t/s. New default 12.3 t/s
(84% accept) = 1.9x over MTP-off, +22% over the old k=2 default.

Tests: MatVec4In_BitwiseMatchesSingleMatVec (Q4_K/Q5_K/Q6_K/Q8_0/F32, serial +
Parallel.For) and MtpBatchTail lane-mapping (k=1..9) are bitwise oracles; the
CUDA k=4 batched-verify suite stays green. MtpDecoder_GreedyParity_LlamaCpp is
untouched and unaffected (both 2In and 4In are per-token bit-identical to single
MatVec, so the CPU pass emits identical per-token logits).

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants