Skip to content

CUDA: batch gate/up/down uploads for selected expert cache misses#460

Open
fmolara wants to merge 1 commit into
antirez:mainfrom
fmolara:upstream/batch-expert-miss-uploads
Open

CUDA: batch gate/up/down uploads for selected expert cache misses#460
fmolara wants to merge 1 commit into
antirez:mainfrom
fmolara:upstream/batch-expert-miss-uploads

Conversation

@fmolara

@fmolara fmolara commented Jun 26, 2026

Copy link
Copy Markdown

Motivation

In the CUDA direct-I/O path, a selected expert cache miss uploads the gate/up/down tensors independently, synchronizing the selected upload stream after each tensor.

This patch batches the three uploads belonging to the same expert and synchronizes the selected upload stream once after all three have been enqueued.

Properties

  • no API changes
  • no kernel changes
  • no cache policy changes
  • resident and compact cache layouts are unchanged
  • unsupported cases fall back to the existing path

Validation

  • A100 CUDA build: pass
  • smoke test: pass
  • deterministic output sanity check: identical visible output
  • local A100 80GB PCIe SSD-streaming tests showed a small repeatable throughput improvement on the tested workload

When a selected expert cache miss occurs in the direct-I/O CUDA path, gate/up/down tensors are currently uploaded independently, synchronizing the selected upload stream after each tensor.

Batch the three uploads for one expert and synchronize the upload stream once after all three have been enqueued.

Layouts, cache policy and kernels are unchanged. Unsupported cases transparently fall back to the existing path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant