Follow-up to #209 (work items #5 + #6 — the low-leverage allocation/drafting micro-opts). #209's #1 shipped in #287.
Item #5 — batched-verify logits buffer churn
Every BatchVerify returns k fresh vocab-sized float[] (~600 KB each → LOH) per step, and EnsureBatchVerifyScratch is exact-size, so --draft-lookup's varying proposal lengths realloc device+host buffers most steps. Reuse per-k cached buffers / return views.
Item #6 — PromptLookupDraft incremental n-gram index
PromptLookupDraft.Propose linearly scans the whole history per step (≥0.1–0.5 ms at 16–32K ctx on the no-match path — exactly the "floor = baseline" case). Replace with a llama.cpp-style last-occurrence map updated O(1) in Append.
Both are correctness-neutral micro-opts; grouped because each is small. Note PromptLookupDraft currently has no production engine caller (it's the --draft-lookup path only), so #6 is lowest priority.
Follow-up to #209 (work items #5 + #6 — the low-leverage allocation/drafting micro-opts). #209's #1 shipped in #287.
Item #5 — batched-verify logits buffer churn
Every
BatchVerifyreturns k fresh vocab-sizedfloat[](~600 KB each → LOH) per step, andEnsureBatchVerifyScratchis exact-size, so--draft-lookup's varying proposal lengths realloc device+host buffers most steps. Reuse per-k cached buffers / return views.Item #6 — PromptLookupDraft incremental n-gram index
PromptLookupDraft.Proposelinearly scans the whole history per step (≥0.1–0.5 ms at 16–32K ctx on the no-match path — exactly the "floor = baseline" case). Replace with a llama.cpp-style last-occurrence map updated O(1) inAppend.Both are correctness-neutral micro-opts; grouped because each is small. Note
PromptLookupDraftcurrently has no production engine caller (it's the--draft-lookuppath only), so #6 is lowest priority.