Tracking the two latent traps the #164 review surfaced (commit 48a5622). Both are correct/inert TODAY; this is so they aren't forgotten if those paths ever activate.
1. bf16 append ring wrap is untested (identity-only today). llm_kv_append_bf16 / llm_kv_append_batched_bf16 got the pos % max_seq_len modulo for symmetry with the f32 kernels, but the only consumer of the bf16 KV cache is the full-context GDN-hybrid path, where pos < max_seq_len makes the modulo the identity. If a future windowed model ever uses the bf16 KV cache, the wrap path would be exercised for the first time with no test. Add a synthetic bf16 ring-wrap bit-wise test (mirror the f32 CudaGdnBatchedTrunkTests append tests) when that happens.
2. CudaHybridForwardPass SWA ring is bare-window (decode-only). It deliberately sizes SWA caches at min(ctx, window) (not SwaRingSize = window+headroom), which is correct because Gemma 4 there is decode-only (IsBatchedPrefillSupported requires !_isGemma4Like) and per-token decode only needs ring ≥ window. If batched/chunked prefill is ever enabled for Gemma 4 in the hybrid path, the bare-window ring would overwrite a still-needed window — switch it to SwaRingSize at that point (there's a code comment flagging this).
3. Pure-decode multi-wrap is untested — generating >4608 tokens to wrap the decode ring is impractically slow via the per-token oracle; the prefill path now covers wrapped reads observably (#164), so this is low value.
No action needed now; reference if touching SWA / bf16-KV / hybrid-Gemma paths.
Tracking the two latent traps the #164 review surfaced (commit 48a5622). Both are correct/inert TODAY; this is so they aren't forgotten if those paths ever activate.
1. bf16 append ring wrap is untested (identity-only today).
llm_kv_append_bf16/llm_kv_append_batched_bf16got thepos % max_seq_lenmodulo for symmetry with the f32 kernels, but the only consumer of the bf16 KV cache is the full-context GDN-hybrid path, wherepos < max_seq_lenmakes the modulo the identity. If a future windowed model ever uses the bf16 KV cache, the wrap path would be exercised for the first time with no test. Add a synthetic bf16 ring-wrap bit-wise test (mirror the f32CudaGdnBatchedTrunkTestsappend tests) when that happens.2. CudaHybridForwardPass SWA ring is bare-window (decode-only). It deliberately sizes SWA caches at
min(ctx, window)(notSwaRingSize= window+headroom), which is correct because Gemma 4 there is decode-only (IsBatchedPrefillSupportedrequires!_isGemma4Like) and per-token decode only needs ring ≥ window. If batched/chunked prefill is ever enabled for Gemma 4 in the hybrid path, the bare-window ring would overwrite a still-needed window — switch it toSwaRingSizeat that point (there's a code comment flagging this).3. Pure-decode multi-wrap is untested — generating >4608 tokens to wrap the decode ring is impractically slow via the per-token oracle; the prefill path now covers wrapped reads observably (#164), so this is low value.
No action needed now; reference if touching SWA / bf16-KV / hybrid-Gemma paths.