CUDA: batch gate/up/down uploads for selected expert cache misses by fmolara · Pull Request #460 · antirez/ds4

fmolara · 2026-06-26T12:58:36Z

Motivation

In the CUDA direct-I/O path, a selected expert cache miss uploads the gate/up/down tensors independently, synchronizing the selected upload stream after each tensor.

This patch batches the three uploads belonging to the same expert and synchronizes the selected upload stream once after all three have been enqueued.

Properties

no API changes
no kernel changes
no cache policy changes
resident and compact cache layouts are unchanged
unsupported cases fall back to the existing path

Validation

A100 CUDA build: pass
smoke test: pass
deterministic output sanity check: identical visible output
local A100 80GB PCIe SSD-streaming tests showed a small repeatable throughput improvement on the tested workload

When a selected expert cache miss occurs in the direct-I/O CUDA path, gate/up/down tensors are currently uploaded independently, synchronizing the selected upload stream after each tensor. Batch the three uploads for one expert and synchronize the upload stream once after all three have been enqueued. Layouts, cache policy and kernels are unchanged. Unsupported cases transparently fall back to the existing path.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: batch gate/up/down uploads for selected expert cache misses#460

CUDA: batch gate/up/down uploads for selected expert cache misses#460
fmolara wants to merge 1 commit into
antirez:mainfrom
fmolara:upstream/batch-expert-miss-uploads

fmolara commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fmolara commented Jun 26, 2026

Motivation

Properties

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant