Fix triton cross-entropy for large vocab sizes, support tensor-parallel by jlamypoirier · Pull Request #466 · ServiceNow/Fast-LLM

jlamypoirier · 2026-01-31T03:44:37Z

✨ Description

Add looped and TP implementations of cross-entropy loss. Turns out the 64K vocab limitation is gone, but going higher makes the kernels way slower, so looped is still better. (Above 32K actually)

Test benchmark (8K tokens, cuda time + est. memory usage):

# Single GPU, vocab 10K
fused 0.348 ms 492.078 MB
triton 0.169 ms 163.873 MB

# Single GPU, vocab 100K
fused 4.241 ms 4915.233 MB
triton 1.709 ms 1638.433 MB

# 2 GPUs, vocab 10K
fused 1.108 ms 655.606 MB
triton 0.198 ms 82.084 MB

# 2 GPUs, vocab 100K
fused 9.569 ms 6553.846 MB
triton 0.996 ms 819.364 MB

jlamypoirier added 3 commits January 30, 2026 22:30

Triton loss

02c28a5

Triton loss

22c8e0b

Parallel attempt

7491e0f

jlamypoirier changed the title ~~Fix triton cross-entropy for large vocab sizes~~ Fix triton cross-entropy for large vocab sizes, support tensor-parallel Feb 2, 2026

jlamypoirier added 2 commits February 3, 2026 00:26

fix

b8e7179

fixes

3c3e0c8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix triton cross-entropy for large vocab sizes, support tensor-parallel#466

Fix triton cross-entropy for large vocab sizes, support tensor-parallel#466
jlamypoirier wants to merge 5 commits intojlp_entropy_loss_tweaksfrom
jlp_triton_loss

jlamypoirier commented Jan 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jlamypoirier commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jlamypoirier commented Jan 31, 2026 •

edited

Loading