perf(cuda): apply PR #143 review micro-opts (16-bit scale loads, shift/mask staging, overflow-check ordering)

Automated review (Gemini code-assist) on PR #143 left 7 medium-priority perf/correctness micro-opts. They are non-blocking — PR #143 is tested, CI-green, and review-toolkit-clean — so deferring them here rather than destabilizing the merge.

**CUDA kernels (`src/SharpInference.Cuda/CudaTextKernels.cs`):**
- [ ] `llm_flash_attn_prefill_f32` K-staging loop (~L3137): replace `idx / hd2` + `idx - kk*hd2` with shift/mask. **Caveat:** bakes a power-of-two `head_dim` assumption into an otherwise-general kernel. Current callers (Gemma 4: 256/512) are all power-of-two, but guard or document before applying.
- [ ] Same kernel, V-staging loop (~L3147): shift/mask on `head_dim`. Same caveat.
- [ ] `llm_matvec_q8_0` scale load (~L1342): single 16-bit `ld.global.u16` instead of two `sharpi_byte_at` (b0 is always 2-byte aligned, stride 34).
- [ ] `MMQ_LOAD_TILE` macro scale load (~L1456): same single-16-bit-load.
- [ ] `llm_dequant_q8_0_to_f16` scale load (~L1594): same.

**C# (`src/SharpInference.Cuda/CudaBackend.cs`, `src/SharpInference.Engine/CudaForwardPass.cs`):**
- [ ] `MatMulBatchedMmq` (~CudaBackend L1827): compute `(long)cols*nTok` and overflow-check *before* the `(int)` cast (currently truncates first, then checks).
- [ ] `EmbedLookupQ8_0Batched` token upload (~CudaForwardPass L1826): use `CollectionsMarshal.AsSpan` for the `List<int>` path to avoid the `ToArray()` allocation.

All are perf/cleanliness; the only correctness-flavored one is the MMQ overflow-check ordering (cosmetic in practice — `cols*nTok` can't realistically exceed int range for current models). Each needs a managed rebuild + the focused `Gemma4Cuda` / `CudaMmqQ8_0Tests` / `CudaFlashAttnTests` pass.

Source: PR #143 review thread.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(cuda): apply PR #143 review micro-opts (16-bit scale loads, shift/mask staging, overflow-check ordering) #145

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

perf(cuda): apply PR #143 review micro-opts (16-bit scale loads, shift/mask staging, overflow-check ordering) #145

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions