Automated review (Gemini code-assist) on PR #143 left 7 medium-priority perf/correctness micro-opts. They are non-blocking — PR #143 is tested, CI-green, and review-toolkit-clean — so deferring them here rather than destabilizing the merge.
CUDA kernels (src/SharpInference.Cuda/CudaTextKernels.cs):
C# (src/SharpInference.Cuda/CudaBackend.cs, src/SharpInference.Engine/CudaForwardPass.cs):
All are perf/cleanliness; the only correctness-flavored one is the MMQ overflow-check ordering (cosmetic in practice — cols*nTok can't realistically exceed int range for current models). Each needs a managed rebuild + the focused Gemma4Cuda / CudaMmqQ8_0Tests / CudaFlashAttnTests pass.
Source: PR #143 review thread.
Automated review (Gemini code-assist) on PR #143 left 7 medium-priority perf/correctness micro-opts. They are non-blocking — PR #143 is tested, CI-green, and review-toolkit-clean — so deferring them here rather than destabilizing the merge.
CUDA kernels (
src/SharpInference.Cuda/CudaTextKernels.cs):llm_flash_attn_prefill_f32K-staging loop (~L3137): replaceidx / hd2+idx - kk*hd2with shift/mask. Caveat: bakes a power-of-twohead_dimassumption into an otherwise-general kernel. Current callers (Gemma 4: 256/512) are all power-of-two, but guard or document before applying.head_dim. Same caveat.llm_matvec_q8_0scale load (~L1342): single 16-bitld.global.u16instead of twosharpi_byte_at(b0 is always 2-byte aligned, stride 34).MMQ_LOAD_TILEmacro scale load (~L1456): same single-16-bit-load.llm_dequant_q8_0_to_f16scale load (~L1594): same.C# (
src/SharpInference.Cuda/CudaBackend.cs,src/SharpInference.Engine/CudaForwardPass.cs):MatMulBatchedMmq(~CudaBackend L1827): compute(long)cols*nTokand overflow-check before the(int)cast (currently truncates first, then checks).EmbedLookupQ8_0Batchedtoken upload (~CudaForwardPass L1826): useCollectionsMarshal.AsSpanfor theList<int>path to avoid theToArray()allocation.All are perf/cleanliness; the only correctness-flavored one is the MMQ overflow-check ordering (cosmetic in practice —
cols*nTokcan't realistically exceed int range for current models). Each needs a managed rebuild + the focusedGemma4Cuda/CudaMmqQ8_0Tests/CudaFlashAttnTestspass.Source: PR #143 review thread.