pekkah · pekkah · Jun 7, 2026 · Jun 7, 2026 · Jun 7, 2026 · Jun 7, 2026
diff --git a/README.md b/README.md
@@ -51,7 +51,7 @@ coherent (`scripts/bench-all.ps1`); top-1 parity vs llama.cpp b8585 verified on
 | Qwen3.6-35B-A3B-MTP (GDN+MoE) | (same) | 22 GB | **CUDA** `-g -1 --no-thinking` (hybrid) | **65.0** | **22.9** | Requires `SHARPI_CPU_MOE=1`: 30 GDN + 10 attn + shared expert on GPU, routed experts CPU mmap. 100% acceptance. Fused GDN scan + batched SDPA (#114-B/#118), bit-identical, grows with ctx |
 | Carnice (Qwen3.6-35B-A3B-MTP finetune) | [mudler](https://huggingface.co/mudler/Carnice-Qwen3.6-MoE-35B-A3B-APEX-MTP-GGUF) | 17 GB | **CUDA** `-g -1 --no-thinking` (hybrid) | **43.6** | **25.0** | agentic finetune of 35B-A3B-MTP; 77% acceptance (`bench-carnice.ps1` — the default prompt 1-token-EOSes on this terser tune). APEX mixed-precision (Q3_K + Q8_0 experts); Q8_KS per-32 int dots auto-enable at load (#99/#101/#107), +4.6% decode at ~4× tighter parity vs plain Q8_K (`SHARPI_Q3K_Q8K=0`/`SHARPI_Q8_0_Q8K=0` to disable). Fused GDN scan + wave SDPA (#114-B/#118) bit-identical past 4096 |
 | Gemma 4 E4B-it Q8 | [unsloth](https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF) | 8 GB | CPU | 4.9 | 5.0 | dense 42-layer gemma4: per-layer head_dim (256 SWA / 512 global), dual-RoPE, KV-share tail (18 layers), 5:1 SWA:global, logit softcap 30, PLE-256 injection (~4.2 GB mmap-resident) |
-| Gemma 4 E4B-it Q8 | (same) | 8 GB | **CUDA** `-g -1 -c 2048` | **3698** | **59** | all 42 layers fit at `-c 2048`. KV-share alias + SWA/global split per layer; PLE projections (~215 MB) upload at construction. **Prefill (#141):** int8 **tensor-core MMQ** matmul (`mma.m16n8k32.s8`, each Q8_0 weight read once as int8 — beats the dequant→fp16→cuBLAS GEMM, drops its fp16 HBM temp; `SHARPI_PREFILL_MMQ=0` reverts) + a **tensor-core flash-attention** prefill (#146/#147): both QK^T and P·V on the mma cores (`mma.m16n8k16.f16`), multi-warp **d-split** so the O tile stays register-resident — replaces the scalar O(n²) per-query attention (which re-streamed each query's K/V window up to ~512×) and is **+27% at ~1K / +40% at 1.8K** over the earlier half2 flash kernel (`SHARPI_PREFILL_FLASH_TC=0` reverts to half2, `=…_FLASH=0` to scalar) + a **SoA Q8_0 weight repack** (#149): all Q8_0 readers (MMQ, dp4a, fp32 matvec, GEMM-N, dequant) read the quants 16-byte-aligned with the fp16 scales split out, killing the `qs` 2-byte-misalignment funnelshift tax — **+10-12% prefill, bit-identical**; `SHARPI_MMQ_SOA=0` reverts) + a batched Q8_0 embedding lookup. **109→3698 at ~1K ctx, →4240 at 1.8K** — profiling showed *attention*, then the matmul inner-loop efficiency, were the dominant prefill costs at realistic prompt lengths. **Decode (#142):** dp4a/Q8_1 int8 matvec (`SHARPI_Q80_DP4A=0` to bisect) + CUDA-graph capture/replay default-on (`SHARPI_CUDA_GRAPH=0` to bisect). All prefill/decode fast paths are argmax-stable vs the fp32 path, not bit-exact (the SoA repack is bit-identical). Remaining gap to llama.cpp (~8475 prefill / ~78 decode): cp.async-pipelined MMQ on the SoA layout + decode matvec work |
+| Gemma 4 E4B-it Q8 | (same) | 8 GB | **CUDA** `-g -1 -c 2048` | **3698** | **59** | all 42 layers fit at `-c 2048`; KV-share alias + per-layer SWA/global split; PLE projections (~215 MB) upload at construction. **Prefill (#141/#146/#147/#149):** int8 tensor-core MMQ (`mma.m16n8k32.s8`, weight read once as int8) + tensor-core flash attention (QK^T and P·V on mma cores, register-resident O via d-split) + a SoA Q8_0 weight repack (16-byte-aligned quants) + batched embedding lookup — **109 → 3698 t/s @1K, 4240 @1.8K**. **>4096 prompts (#162/#164):** a real SWA KV ring (cache sized window + one chunk span, indexed `pos % size`) lets long prompts take the chunked batched-flash path instead of the ~8× slower per-token fallback (**2.72× — 47.7 → 129.6 t/s on a 5.3K-token prompt**, measured at a context that admits it, not the `-c 2048` of this row), and fixes long-context correctness past the 512 window (the cache was previously window-sized but indexed absolutely → out of bounds). **Decode (#142):** dp4a/Q8_1 int8 matvec + CUDA-graph replay. Fast paths argmax-stable vs fp32 (SoA repack + ring bit-identical below the window). Bisect env: `SHARPI_PREFILL_MMQ` / `_FLASH_TC` / `SHARPI_MMQ_SOA` / `SHARPI_Q80_DP4A` / `SHARPI_CUDA_GRAPH` / `SHARPI_BATCHED_PREFILL` (`=0`). Remaining gap to llama.cpp (~8475 prefill / ~78 decode): cp.async-pipelined MMQ + decode matvec |
 | Gemma 4 E4B-it Q8 | (same) | 8 GB | **CUDA** `-g 22 -c 2048` (hybrid) | 6.6 | 6.8 | 22 GPU + 20 CPU layers. `-g ≤ 22` required so the CPU shared-KV tail can read its own-KV source layers; CPU dense-FFN dominates decode (bandwidth-bound). `SHARPI_CUDA_PROFILE=1` for per-phase breakdown |
 
 _Numbers re-measured across every on-disk row at ~1K ctx so the prefill column is comparable; per-issue

diff --git a/src/SharpInference.Cuda/CudaTextKernels.cs b/src/SharpInference.Cuda/CudaTextKernels.cs
@@ -599,7 +599,10 @@ __device__ __forceinline__ unsigned int sharpi_uint_at(const unsigned int* __res
 {
     int i = (int)(blockIdx.x * blockDim.x + threadIdx.x);
     if (i >= kv_dim) return;
-    long offset = (long)position * (long)kv_dim + (long)i;
+    // Ring slot: `position % max_seq_len`. `max_seq_len` is the allocated cache size,
+    // so for a full-context (dense / global) cache `position < max_seq_len` makes this
+    // the identity; for a window-sized SWA ring it wraps the write into the ring.
+    long offset = (long)(position % max_seq_len) * (long)kv_dim + (long)i;
     k_cache[offset] = k_in[i];
     v_cache[offset] = v_in[i];
 }
@@ -618,7 +621,11 @@ __device__ __forceinline__ unsigned int sharpi_uint_at(const unsigned int* __res
 {
     int i = (int)(blockIdx.x * blockDim.x + threadIdx.x);
     if (i >= kv_dim) return;
-    long offset = (long)position * (long)kv_dim + (long)i;
+    // Ring slot `position % max_seq_len` (identity for a full-context cache; wraps a
+    // window-sized ring). Matches the f32 llm_kv_append so the write/read indexing stays
+    // uniform if a windowed model ever uses the bf16 KV cache (today only the full-context
+    // GDN-hybrid path does, where position < max_seq_len makes this the identity).
+    long offset = (long)(position % max_seq_len) * (long)kv_dim + (long)i;
     k_cache[offset] = (unsigned short)sharpi_fp32_to_bf16(k_in[i]);
     v_cache[offset] = (unsigned short)sharpi_fp32_to_bf16(v_in[i]);
 }
@@ -4066,7 +4073,9 @@ __device__ __forceinline__ void sharpi_q4k_scale_min(
     for (int t = (int)tid; t < eff_seq; t += 256) {
         int abs_t = t + window_start;
         float dot = 0.f;
-        long k_off = (long)abs_t * (long)kv_dim + (long)kv_head * (long)head_dim;
+        // Ring slot `abs_t % max_seq_len` (max_seq_len = allocated cache size): identity
+        // for a full cache, wraps a window-sized SWA ring. abs_t itself stays logical.
+        long k_off = (long)(abs_t % max_seq_len) * (long)kv_dim + (long)kv_head * (long)head_dim;
         for (int d = 0; d < head_dim; d++)
             dot += q[q_off + d] * k_cache[k_off + d];
         float score = dot * scale;
@@ -4127,7 +4136,7 @@ __device__ __forceinline__ void sharpi_q4k_scale_min(
         for (int t = 0; t < eff_seq; t++) {
             int abs_t = t + window_start;
             float weight = use_shared ? shared_scores[t] : head_scratch[t];
-            long v_off = (long)abs_t * (long)kv_dim + (long)kv_head * (long)head_dim;
+            long v_off = (long)(abs_t % max_seq_len) * (long)kv_dim + (long)kv_head * (long)head_dim;
             acc += weight * v_cache[v_off + d];
         }
         out[out_off + d] = acc;
@@ -4246,8 +4255,10 @@ __device__ __forceinline__ float fatc_mask(float s, int qpos, int abs_k, int win
         for (int idx = lane; idx < FATC_KT * head_dim; idx += 32) {
             int kk = idx / head_dim, d = idx - kk * head_dim;
             int abs_k = kt0 + kk;
+            // Cache read at ring slot `abs_k % max_seq_len` (identity for a full cache,
+            // wraps a window-sized SWA ring); abs_k stays logical for the causal bound.
             float kv = (abs_k < key_end)
-                ? k_cache[(long)abs_k * kv_dim + (long)kv_head * head_dim + d] : 0.f;
+                ? k_cache[(long)(abs_k % max_seq_len) * kv_dim + (long)kv_head * head_dim + d] : 0.f;
             sKV[idx] = (unsigned short)sharpi_fp32_to_fp16(kv);
         }
         __syncthreads();
@@ -4345,7 +4356,7 @@ asm volatile(
             int kk = idx / head_dim, d = idx - kk * head_dim;
             int abs_k = kt0 + kk;
             float vv = (abs_k < key_end)
-                ? v_cache[(long)abs_k * kv_dim + (long)kv_head * head_dim + d] : 0.f;
+                ? v_cache[(long)(abs_k % max_seq_len) * kv_dim + (long)kv_head * head_dim + d] : 0.f;
             sKV[idx] = (unsigned short)sharpi_fp32_to_fp16(vv);
         }
         __syncthreads();
@@ -4454,8 +4465,10 @@ asm volatile(
         for (int idx = tid; idx < FATC2_KT * head_dim; idx += FATC2_W * 32) {
             int kk = idx / head_dim, d = idx - kk * head_dim;
             int abs_k = kt0 + kk;
+            // Cache read at ring slot `abs_k % max_seq_len` (identity for a full cache,
+            // wraps a window-sized SWA ring); abs_k stays logical for the causal bound.
             float kv = (abs_k < key_end)
-                ? k_cache[(long)abs_k * kv_dim + (long)kv_head * head_dim + d] : 0.f;
+                ? k_cache[(long)(abs_k % max_seq_len) * kv_dim + (long)kv_head * head_dim + d] : 0.f;
             sKV[idx] = (unsigned short)sharpi_fp32_to_fp16(kv);
         }
         __syncthreads();
@@ -4566,7 +4579,7 @@ asm volatile(
             int kk = idx / head_dim, d = idx - kk * head_dim;
             int abs_k = kt0 + kk;
             float vv = (abs_k < key_end)
-                ? v_cache[(long)abs_k * kv_dim + (long)kv_head * head_dim + d] : 0.f;
+                ? v_cache[(long)(abs_k % max_seq_len) * kv_dim + (long)kv_head * head_dim + d] : 0.f;
             sKV[idx] = (unsigned short)sharpi_fp32_to_fp16(vv);
         }
         __syncthreads();
@@ -4697,7 +4710,9 @@ asm volatile(
             int kk = idx / hd2, pr = idx - kk * hd2;
             unsigned int kh = 0u;
             if (kk < tile_keys) {
-                long off = (long)(kt0 + kk) * kv_dim + (long)kv_head * head_dim + 2 * pr;
+                // Ring slot `(kt0+kk) % max_seq_len`: identity for a full cache, wraps a
+                // window-sized SWA ring. The kt0+kk index stays logical for tile bounds.
+                long off = (long)((kt0 + kk) % max_seq_len) * kv_dim + (long)kv_head * head_dim + 2 * pr;
                 kh = sharpi_f32x2_to_f16x2(k_cache[off], k_cache[off + 1]);
             }
             sKh[idx] = kh;
@@ -4706,7 +4721,7 @@ asm volatile(
         for (int idx = tid; idx < kt_tile * head_dim; idx += (int)blockDim.x) {
             int kk = idx / head_dim, d = idx - kk * head_dim;
             sV[idx] = (kk < tile_keys)
-                ? v_cache[(long)(kt0 + kk) * kv_dim + (long)kv_head * head_dim + d]
+                ? v_cache[(long)((kt0 + kk) % max_seq_len) * kv_dim + (long)kv_head * head_dim + d]
                 : 0.f;
         }
         __syncthreads();
@@ -4799,7 +4814,9 @@ asm volatile(
     for (int t = (int)tid; t < eff_seq; t += 256) {
         int abs_t = t + window_start;
         float dot = 0.f;
-        long k_off = (long)abs_t * (long)kv_dim + (long)kv_head * (long)head_dim;
+        // Ring slot `abs_t % max_seq_len` (max_seq_len = allocated cache size): identity
+        // for a full cache, wraps a window-sized SWA ring. abs_t itself stays logical.
+        long k_off = (long)(abs_t % max_seq_len) * (long)kv_dim + (long)kv_head * (long)head_dim;
         for (int dd = 0; dd < head_dim; dd++)
             dot += q[q_off + dd] * k_cache[k_off + dd];
         shared_scores[t] = dot * scale;
@@ -4841,7 +4858,7 @@ asm volatile(
         float acc = 0.f;
         for (int t = 0; t < eff_seq; t++) {
             int abs_t = t + window_start;
-            long v_off = (long)abs_t * (long)kv_dim + (long)kv_head * (long)head_dim;
+            long v_off = (long)(abs_t % max_seq_len) * (long)kv_dim + (long)kv_head * (long)head_dim;
             acc += shared_scores[t] * v_cache[v_off + dd];
         }
         out[out_off + dd] = acc;
@@ -5616,7 +5633,9 @@ asm volatile(
     int e = (int)(blockIdx.x * blockDim.x + threadIdx.x);
     int i = (int)blockIdx.y;
     if (e >= kv_dim || i >= n_tok) return;
-    long off = (long)(start_pos + i) * (long)kv_dim + (long)e;
+    // Ring slot `(start_pos+i) % max_seq_len`: identity for a full-context cache
+    // (position < max_seq_len), wraps into a window-sized SWA ring otherwise.
+    long off = (long)((start_pos + i) % max_seq_len) * (long)kv_dim + (long)e;
     k_cache[off] = k_all[(long)i * kv_dim + e];
     v_cache[off] = v_all[(long)i * kv_dim + e];
 }
@@ -5630,7 +5649,9 @@ asm volatile(
     int e = (int)(blockIdx.x * blockDim.x + threadIdx.x);
     int i = (int)blockIdx.y;
     if (e >= kv_dim || i >= n_tok) return;
-    long off = (long)(start_pos + i) * (long)kv_dim + (long)e;
+    // Ring slot `(start_pos+i) % max_seq_len` (identity for a full-context cache; wraps a
+    // window-sized ring) — kept in lockstep with the f32 llm_kv_append_batched.
+    long off = (long)((start_pos + i) % max_seq_len) * (long)kv_dim + (long)e;
     k_cache[off] = (unsigned short)sharpi_fp32_to_bf16(k_all[(long)i * kv_dim + e]);
     v_cache[off] = (unsigned short)sharpi_fp32_to_bf16(v_all[(long)i * kv_dim + e]);
 }