Skip to content

SWA KV ring: bf16-append wrap + CudaHybridForwardPass batched path are latent-only / untested #166

Description

@pekkah

Tracking the two latent traps the #164 review surfaced (commit 48a5622). Both are correct/inert TODAY; this is so they aren't forgotten if those paths ever activate.

1. bf16 append ring wrap is untested (identity-only today). llm_kv_append_bf16 / llm_kv_append_batched_bf16 got the pos % max_seq_len modulo for symmetry with the f32 kernels, but the only consumer of the bf16 KV cache is the full-context GDN-hybrid path, where pos < max_seq_len makes the modulo the identity. If a future windowed model ever uses the bf16 KV cache, the wrap path would be exercised for the first time with no test. Add a synthetic bf16 ring-wrap bit-wise test (mirror the f32 CudaGdnBatchedTrunkTests append tests) when that happens.

2. CudaHybridForwardPass SWA ring is bare-window (decode-only). It deliberately sizes SWA caches at min(ctx, window) (not SwaRingSize = window+headroom), which is correct because Gemma 4 there is decode-only (IsBatchedPrefillSupported requires !_isGemma4Like) and per-token decode only needs ring ≥ window. If batched/chunked prefill is ever enabled for Gemma 4 in the hybrid path, the bare-window ring would overwrite a still-needed window — switch it to SwaRingSize at that point (there's a code comment flagging this).

3. Pure-decode multi-wrap is untested — generating >4608 tokens to wrap the decode ring is impractically slow via the per-token oracle; the prefill path now covers wrapped reads observably (#164), so this is low value.

No action needed now; reference if touching SWA / bf16-KV / hybrid-Gemma paths.

Metadata

Metadata

Assignees

No one assigned

    Labels

    maintenanceRecurring upkeep / sync with upstream

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions