Skip to content

perf(cuda): apply PR #143 review micro-opts (16-bit scale loads, shift/mask staging, overflow-check ordering) #145

Description

@pekkah

Automated review (Gemini code-assist) on PR #143 left 7 medium-priority perf/correctness micro-opts. They are non-blocking — PR #143 is tested, CI-green, and review-toolkit-clean — so deferring them here rather than destabilizing the merge.

CUDA kernels (src/SharpInference.Cuda/CudaTextKernels.cs):

  • llm_flash_attn_prefill_f32 K-staging loop (~L3137): replace idx / hd2 + idx - kk*hd2 with shift/mask. Caveat: bakes a power-of-two head_dim assumption into an otherwise-general kernel. Current callers (Gemma 4: 256/512) are all power-of-two, but guard or document before applying.
  • Same kernel, V-staging loop (~L3147): shift/mask on head_dim. Same caveat.
  • llm_matvec_q8_0 scale load (~L1342): single 16-bit ld.global.u16 instead of two sharpi_byte_at (b0 is always 2-byte aligned, stride 34).
  • MMQ_LOAD_TILE macro scale load (~L1456): same single-16-bit-load.
  • llm_dequant_q8_0_to_f16 scale load (~L1594): same.

C# (src/SharpInference.Cuda/CudaBackend.cs, src/SharpInference.Engine/CudaForwardPass.cs):

  • MatMulBatchedMmq (~CudaBackend L1827): compute (long)cols*nTok and overflow-check before the (int) cast (currently truncates first, then checks).
  • EmbedLookupQ8_0Batched token upload (~CudaForwardPass L1826): use CollectionsMarshal.AsSpan for the List<int> path to avoid the ToArray() allocation.

All are perf/cleanliness; the only correctness-flavored one is the MMQ overflow-check ordering (cosmetic in practice — cols*nTok can't realistically exceed int range for current models). Each needs a managed rebuild + the focused Gemma4Cuda / CudaMmqQ8_0Tests / CudaFlashAttnTests pass.

Source: PR #143 review thread.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions