Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ coherent (`scripts/bench-all.ps1`); top-1 parity vs llama.cpp b8585 verified on
| Qwen3.6-35B-A3B-MTP (GDN+MoE) | (same) | 22 GB | **CUDA** `-g -1 --no-thinking` (hybrid) | **65.0** | **22.9** | Requires `SHARPI_CPU_MOE=1`: 30 GDN + 10 attn + shared expert on GPU, routed experts CPU mmap. 100% acceptance. Fused GDN scan + batched SDPA (#114-B/#118), bit-identical, grows with ctx |
| Carnice (Qwen3.6-35B-A3B-MTP finetune) | [mudler](https://huggingface.co/mudler/Carnice-Qwen3.6-MoE-35B-A3B-APEX-MTP-GGUF) | 17 GB | **CUDA** `-g -1 --no-thinking` (hybrid) | **43.6** | **25.0** | agentic finetune of 35B-A3B-MTP; 77% acceptance (`bench-carnice.ps1` — the default prompt 1-token-EOSes on this terser tune). APEX mixed-precision (Q3_K + Q8_0 experts); Q8_KS per-32 int dots auto-enable at load (#99/#101/#107), +4.6% decode at ~4× tighter parity vs plain Q8_K (`SHARPI_Q3K_Q8K=0`/`SHARPI_Q8_0_Q8K=0` to disable). Fused GDN scan + wave SDPA (#114-B/#118) bit-identical past 4096 |
| Gemma 4 E4B-it Q8 | [unsloth](https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF) | 8 GB | CPU | 4.9 | 5.0 | dense 42-layer gemma4: per-layer head_dim (256 SWA / 512 global), dual-RoPE, KV-share tail (18 layers), 5:1 SWA:global, logit softcap 30, PLE-256 injection (~4.2 GB mmap-resident) |
| Gemma 4 E4B-it Q8 | (same) | 8 GB | **CUDA** `-g -1 -c 2048` | **3698** | **59** | all 42 layers fit at `-c 2048`. KV-share alias + SWA/global split per layer; PLE projections (~215 MB) upload at construction. **Prefill (#141):** int8 **tensor-core MMQ** matmul (`mma.m16n8k32.s8`, each Q8_0 weight read once as int8 — beats the dequant→fp16→cuBLAS GEMM, drops its fp16 HBM temp; `SHARPI_PREFILL_MMQ=0` reverts) + a **tensor-core flash-attention** prefill (#146/#147): both QK^T and P·V on the mma cores (`mma.m16n8k16.f16`), multi-warp **d-split** so the O tile stays register-resident — replaces the scalar O(n²) per-query attention (which re-streamed each query's K/V window up to ~512×) and is **+27% at ~1K / +40% at 1.8K** over the earlier half2 flash kernel (`SHARPI_PREFILL_FLASH_TC=0` reverts to half2, `=…_FLASH=0` to scalar) + a **SoA Q8_0 weight repack** (#149): all Q8_0 readers (MMQ, dp4a, fp32 matvec, GEMM-N, dequant) read the quants 16-byte-aligned with the fp16 scales split out, killing the `qs` 2-byte-misalignment funnelshift tax — **+10-12% prefill, bit-identical**; `SHARPI_MMQ_SOA=0` reverts) + a batched Q8_0 embedding lookup. **109→3698 at ~1K ctx, →4240 at 1.8K** — profiling showed *attention*, then the matmul inner-loop efficiency, were the dominant prefill costs at realistic prompt lengths. **Decode (#142):** dp4a/Q8_1 int8 matvec (`SHARPI_Q80_DP4A=0` to bisect) + CUDA-graph capture/replay default-on (`SHARPI_CUDA_GRAPH=0` to bisect). All prefill/decode fast paths are argmax-stable vs the fp32 path, not bit-exact (the SoA repack is bit-identical). Remaining gap to llama.cpp (~8475 prefill / ~78 decode): cp.async-pipelined MMQ on the SoA layout + decode matvec work |
| Gemma 4 E4B-it Q8 | (same) | 8 GB | **CUDA** `-g -1 -c 2048` | **3698** | **59** | all 42 layers fit at `-c 2048`; KV-share alias + per-layer SWA/global split; PLE projections (~215 MB) upload at construction. **Prefill (#141/#146/#147/#149):** int8 tensor-core MMQ (`mma.m16n8k32.s8`, weight read once as int8) + tensor-core flash attention (QK^T and P·V on mma cores, register-resident O via d-split) + a SoA Q8_0 weight repack (16-byte-aligned quants) + batched embedding lookup — **109 → 3698 t/s @1K, 4240 @1.8K**. **>4096 prompts (#162/#164):** a real SWA KV ring (cache sized window + one chunk span, indexed `pos % size`) lets long prompts take the chunked batched-flash path instead of the ~8× slower per-token fallback (**2.72× — 47.7 → 129.6 t/s on a 5.3K-token prompt**, measured at a context that admits it, not the `-c 2048` of this row), and fixes long-context correctness past the 512 window (the cache was previously window-sized but indexed absolutely → out of bounds). **Decode (#142):** dp4a/Q8_1 int8 matvec + CUDA-graph replay. Fast paths argmax-stable vs fp32 (SoA repack + ring bit-identical below the window). Bisect env: `SHARPI_PREFILL_MMQ` / `_FLASH_TC` / `SHARPI_MMQ_SOA` / `SHARPI_Q80_DP4A` / `SHARPI_CUDA_GRAPH` / `SHARPI_BATCHED_PREFILL` (`=0`). Remaining gap to llama.cpp (~8475 prefill / ~78 decode): cp.async-pipelined MMQ + decode matvec |
| Gemma 4 E4B-it Q8 | (same) | 8 GB | **CUDA** `-g 22 -c 2048` (hybrid) | 6.6 | 6.8 | 22 GPU + 20 CPU layers. `-g ≤ 22` required so the CPU shared-KV tail can read its own-KV source layers; CPU dense-FFN dominates decode (bandwidth-bound). `SHARPI_CUDA_PROFILE=1` for per-phase breakdown |

_Numbers re-measured across every on-disk row at ~1K ctx so the prefill column is comparable; per-issue
Expand Down
49 changes: 35 additions & 14 deletions src/SharpInference.Cuda/CudaTextKernels.cs
Original file line number Diff line number Diff line change
Expand Up @@ -599,7 +599,10 @@ __device__ __forceinline__ unsigned int sharpi_uint_at(const unsigned int* __res
{
int i = (int)(blockIdx.x * blockDim.x + threadIdx.x);
if (i >= kv_dim) return;
long offset = (long)position * (long)kv_dim + (long)i;
// Ring slot: `position % max_seq_len`. `max_seq_len` is the allocated cache size,
// so for a full-context (dense / global) cache `position < max_seq_len` makes this
// the identity; for a window-sized SWA ring it wraps the write into the ring.
long offset = (long)(position % max_seq_len) * (long)kv_dim + (long)i;
k_cache[offset] = k_in[i];
v_cache[offset] = v_in[i];
}
Expand All @@ -618,7 +621,11 @@ __device__ __forceinline__ unsigned int sharpi_uint_at(const unsigned int* __res
{
int i = (int)(blockIdx.x * blockDim.x + threadIdx.x);
if (i >= kv_dim) return;
long offset = (long)position * (long)kv_dim + (long)i;
// Ring slot `position % max_seq_len` (identity for a full-context cache; wraps a
// window-sized ring). Matches the f32 llm_kv_append so the write/read indexing stays
// uniform if a windowed model ever uses the bf16 KV cache (today only the full-context
// GDN-hybrid path does, where position < max_seq_len makes this the identity).
long offset = (long)(position % max_seq_len) * (long)kv_dim + (long)i;
k_cache[offset] = (unsigned short)sharpi_fp32_to_bf16(k_in[i]);
v_cache[offset] = (unsigned short)sharpi_fp32_to_bf16(v_in[i]);
}
Expand Down Expand Up @@ -4066,7 +4073,9 @@ __device__ __forceinline__ void sharpi_q4k_scale_min(
for (int t = (int)tid; t < eff_seq; t += 256) {
int abs_t = t + window_start;
float dot = 0.f;
long k_off = (long)abs_t * (long)kv_dim + (long)kv_head * (long)head_dim;
// Ring slot `abs_t % max_seq_len` (max_seq_len = allocated cache size): identity
// for a full cache, wraps a window-sized SWA ring. abs_t itself stays logical.
long k_off = (long)(abs_t % max_seq_len) * (long)kv_dim + (long)kv_head * (long)head_dim;
for (int d = 0; d < head_dim; d++)
dot += q[q_off + d] * k_cache[k_off + d];
float score = dot * scale;
Expand Down Expand Up @@ -4127,7 +4136,7 @@ __device__ __forceinline__ void sharpi_q4k_scale_min(
for (int t = 0; t < eff_seq; t++) {
int abs_t = t + window_start;
float weight = use_shared ? shared_scores[t] : head_scratch[t];
long v_off = (long)abs_t * (long)kv_dim + (long)kv_head * (long)head_dim;
long v_off = (long)(abs_t % max_seq_len) * (long)kv_dim + (long)kv_head * (long)head_dim;
acc += weight * v_cache[v_off + d];
}
out[out_off + d] = acc;
Expand Down Expand Up @@ -4246,8 +4255,10 @@ __device__ __forceinline__ float fatc_mask(float s, int qpos, int abs_k, int win
for (int idx = lane; idx < FATC_KT * head_dim; idx += 32) {
int kk = idx / head_dim, d = idx - kk * head_dim;
int abs_k = kt0 + kk;
// Cache read at ring slot `abs_k % max_seq_len` (identity for a full cache,
// wraps a window-sized SWA ring); abs_k stays logical for the causal bound.
float kv = (abs_k < key_end)
? k_cache[(long)abs_k * kv_dim + (long)kv_head * head_dim + d] : 0.f;
? k_cache[(long)(abs_k % max_seq_len) * kv_dim + (long)kv_head * head_dim + d] : 0.f;
sKV[idx] = (unsigned short)sharpi_fp32_to_fp16(kv);
}
__syncthreads();
Expand Down Expand Up @@ -4345,7 +4356,7 @@ asm volatile(
int kk = idx / head_dim, d = idx - kk * head_dim;
int abs_k = kt0 + kk;
float vv = (abs_k < key_end)
? v_cache[(long)abs_k * kv_dim + (long)kv_head * head_dim + d] : 0.f;
? v_cache[(long)(abs_k % max_seq_len) * kv_dim + (long)kv_head * head_dim + d] : 0.f;
sKV[idx] = (unsigned short)sharpi_fp32_to_fp16(vv);
}
__syncthreads();
Expand Down Expand Up @@ -4454,8 +4465,10 @@ asm volatile(
for (int idx = tid; idx < FATC2_KT * head_dim; idx += FATC2_W * 32) {
int kk = idx / head_dim, d = idx - kk * head_dim;
int abs_k = kt0 + kk;
// Cache read at ring slot `abs_k % max_seq_len` (identity for a full cache,
// wraps a window-sized SWA ring); abs_k stays logical for the causal bound.
float kv = (abs_k < key_end)
? k_cache[(long)abs_k * kv_dim + (long)kv_head * head_dim + d] : 0.f;
? k_cache[(long)(abs_k % max_seq_len) * kv_dim + (long)kv_head * head_dim + d] : 0.f;
sKV[idx] = (unsigned short)sharpi_fp32_to_fp16(kv);
}
__syncthreads();
Expand Down Expand Up @@ -4566,7 +4579,7 @@ asm volatile(
int kk = idx / head_dim, d = idx - kk * head_dim;
int abs_k = kt0 + kk;
float vv = (abs_k < key_end)
? v_cache[(long)abs_k * kv_dim + (long)kv_head * head_dim + d] : 0.f;
? v_cache[(long)(abs_k % max_seq_len) * kv_dim + (long)kv_head * head_dim + d] : 0.f;
sKV[idx] = (unsigned short)sharpi_fp32_to_fp16(vv);
}
__syncthreads();
Expand Down Expand Up @@ -4697,7 +4710,9 @@ asm volatile(
int kk = idx / hd2, pr = idx - kk * hd2;
unsigned int kh = 0u;
if (kk < tile_keys) {
long off = (long)(kt0 + kk) * kv_dim + (long)kv_head * head_dim + 2 * pr;
// Ring slot `(kt0+kk) % max_seq_len`: identity for a full cache, wraps a
// window-sized SWA ring. The kt0+kk index stays logical for tile bounds.
long off = (long)((kt0 + kk) % max_seq_len) * kv_dim + (long)kv_head * head_dim + 2 * pr;
kh = sharpi_f32x2_to_f16x2(k_cache[off], k_cache[off + 1]);
}
sKh[idx] = kh;
Expand All @@ -4706,7 +4721,7 @@ asm volatile(
for (int idx = tid; idx < kt_tile * head_dim; idx += (int)blockDim.x) {
int kk = idx / head_dim, d = idx - kk * head_dim;
sV[idx] = (kk < tile_keys)
? v_cache[(long)(kt0 + kk) * kv_dim + (long)kv_head * head_dim + d]
? v_cache[(long)((kt0 + kk) % max_seq_len) * kv_dim + (long)kv_head * head_dim + d]
: 0.f;
}
__syncthreads();
Expand Down Expand Up @@ -4799,7 +4814,9 @@ asm volatile(
for (int t = (int)tid; t < eff_seq; t += 256) {
int abs_t = t + window_start;
float dot = 0.f;
long k_off = (long)abs_t * (long)kv_dim + (long)kv_head * (long)head_dim;
// Ring slot `abs_t % max_seq_len` (max_seq_len = allocated cache size): identity
// for a full cache, wraps a window-sized SWA ring. abs_t itself stays logical.
long k_off = (long)(abs_t % max_seq_len) * (long)kv_dim + (long)kv_head * (long)head_dim;
for (int dd = 0; dd < head_dim; dd++)
dot += q[q_off + dd] * k_cache[k_off + dd];
shared_scores[t] = dot * scale;
Expand Down Expand Up @@ -4841,7 +4858,7 @@ asm volatile(
float acc = 0.f;
for (int t = 0; t < eff_seq; t++) {
int abs_t = t + window_start;
long v_off = (long)abs_t * (long)kv_dim + (long)kv_head * (long)head_dim;
long v_off = (long)(abs_t % max_seq_len) * (long)kv_dim + (long)kv_head * (long)head_dim;
acc += shared_scores[t] * v_cache[v_off + dd];
}
out[out_off + dd] = acc;
Expand Down Expand Up @@ -5616,7 +5633,9 @@ asm volatile(
int e = (int)(blockIdx.x * blockDim.x + threadIdx.x);
int i = (int)blockIdx.y;
if (e >= kv_dim || i >= n_tok) return;
long off = (long)(start_pos + i) * (long)kv_dim + (long)e;
// Ring slot `(start_pos+i) % max_seq_len`: identity for a full-context cache
// (position < max_seq_len), wraps into a window-sized SWA ring otherwise.
long off = (long)((start_pos + i) % max_seq_len) * (long)kv_dim + (long)e;
k_cache[off] = k_all[(long)i * kv_dim + e];
v_cache[off] = v_all[(long)i * kv_dim + e];
}
Expand All @@ -5630,7 +5649,9 @@ asm volatile(
int e = (int)(blockIdx.x * blockDim.x + threadIdx.x);
int i = (int)blockIdx.y;
if (e >= kv_dim || i >= n_tok) return;
long off = (long)(start_pos + i) * (long)kv_dim + (long)e;
// Ring slot `(start_pos+i) % max_seq_len` (identity for a full-context cache; wraps a
// window-sized ring) — kept in lockstep with the f32 llm_kv_append_batched.
long off = (long)((start_pos + i) % max_seq_len) * (long)kv_dim + (long)e;
k_cache[off] = (unsigned short)sharpi_fp32_to_bf16(k_all[(long)i * kv_dim + e]);
v_cache[off] = (unsigned short)sharpi_fp32_to_bf16(v_all[(long)i * kv_dim + e]);
}
Expand Down
Loading