Skip to content

[NV] Update DSR1 GB200 FP4 Disagg Submission#510

Merged
jthomson04 merged 41 commits intomainfrom
dsr1-fp4-gb200-dynamo-trt-260121
Jan 31, 2026
Merged

[NV] Update DSR1 GB200 FP4 Disagg Submission#510
jthomson04 merged 41 commits intomainfrom
dsr1-fp4-gb200-dynamo-trt-260121

Conversation

@jthomson04
Copy link
Copy Markdown
Collaborator

@jthomson04 jthomson04 commented Jan 21, 2026

This MR updates our dsr1-fp4-gb200-dynamo-trt submission. As a part of this MR, we also introduce a new way to launch Dynamo slurm jobs through srt-slurm. The new workflow for launching jobs is:

  1. Clone and install srtslurm.
  2. Set cluster specific information (partition, account, etc.)
  3. Use config file specified in the CONFIG_FILE env var from nvidia-master.yaml
  4. srtctl apply -f $CONFIG_FILE

jthomson04 and others added 3 commits January 21, 2026 09:19
Signed-off-by: jthomson04 <jothomson@nvidia.com>
@jthomson04 jthomson04 changed the title Dsr1 fp4 gb200 dynamo trt 260121 [NV] Update DSR1 GB200 FP4 Disagg Submission Jan 21, 2026
Signed-off-by: jthomson04 <jothomson@nvidia.com>
@jthomson04 jthomson04 force-pushed the dsr1-fp4-gb200-dynamo-trt-260121 branch from a365a91 to 9cfacda Compare January 21, 2026 17:37
@jthomson04 jthomson04 marked this pull request as ready for review January 21, 2026 17:53
@jthomson04 jthomson04 requested a review from a team as a code owner January 21, 2026 17:53
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Jan 21, 2026

Claude finished @jthomson04's task in 1m 59s —— View job


PR Review Complete

  • Analyze PR context and changes
  • Check for master config changes and perf-changelog requirement
  • Review for bugs, logic errors, and blocking issues
  • Post summary

Summary: This PR updates the DSR1 GB200 FP4 Disagg submission and introduces a new workflow for launching Dynamo slurm jobs through srt-slurm. The changes look reasonable overall, but there are two issues to address:

🔴 BLOCKING: Master config file .github/configs/nvidia-master.yaml was modified but perf-changelog.yaml was not updated. When changing master config files, you must add a corresponding entry to perf-changelog.yaml documenting the changes.

🟡 WARNING: Hardcoded fp8 in srtctl tags at runners/launch_gb200-nv.sh:100 - should use ${PRECISION} variable instead since this is an FP4 submission.

Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com>
@functionstackx
Copy link
Copy Markdown
Contributor

@claude what is the progress on https://github.com/InferenceMAX/InferenceMAX/actions/runs/21219580245

how many is complete, any failure so far? can u monitor each 5 minutes till all the jobs in 21219580245 is done?

@claude-code-infmax
Copy link
Copy Markdown
Contributor

claude-code-infmax bot commented Jan 21, 2026

🔄 Workflow Monitoring - Status Update #12

Workflow Run: 21219580245
Time: 2026-01-21 ~23:15 UTC

📊 Current Progress

Status Count Percentage
✅ Completed 18 72%
🔄 In Progress 1 4%
⏳ Queued 6 24%
❌ Failed 0 0%

Progress Bar: ████████████████░░░░░░ 72%

🔄 Currently Running

Job Started Duration
dsr1_8k1k gb200 dynamo-trt fp4 specdecod-mtp 22:43 UTC ~32 min

⏳ Queued Jobs (6 remaining)

  • dsr1_8k1k gb200 dynamo-trt fp4 specdecod-none (x3)
  • dsr1_1k1k gb200 dynamo-trt fp4 specdecod-none (x2)
  • dsr1_1k1k gb200 dynamo-trt fp4 specdecod-mtp (x1)

✅ Recently Completed

  • dsr1_1k1k gb200 dynamo-trt fp4 specdecod-none (22:28-22:42 UTC) ✅
  • dsr1_1k1k gb200 dynamo-trt fp4 specdecod-mtp (22:08-22:28 UTC) ✅
  • dsr1_8k1k gb200 dynamo-trt fp4 specdecod-none (21:23-22:08 UTC) ✅

Monitoring every 5 minutes until all jobs complete. Next check in 5 minutes.

@jthomson04
Copy link
Copy Markdown
Collaborator Author

We've incorporated some updates to our SGL 1k/8k submission into this MR as well. See https://github.com/InferenceMAX/InferenceMAX/pull/513.

@functionstackx
Copy link
Copy Markdown
Contributor

@pr-claude please review. r they missing perf-changelog.yaml

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Jan 22, 2026

Claude finished @functionstackx's task in 48s —— View job


PR Review

  • Check if perf-changelog.yaml was updated
  • Review changed files for issues
  • Post summary

Answer: Yes, perf-changelog.yaml is missing.

.github/configs/nvidia-master.yaml was modified in this PR but perf-changelog.yaml was not updated. This was flagged as a 🔴 BLOCKING issue in my previous review.

Files changed in this PR:

  1. .github/configs/nvidia-master.yaml (+170/-228)
  2. benchmarks/dsr1_fp4_gb200_dynamo-trt_slurm.sh (deleted)
  3. runners/launch_gb200-nv.sh (+122/-96)

Status of previous issues:

  • Fixed: The hardcoded fp8 in srtctl tags is now replaced with ${PRECISION} variable
  • Still open: perf-changelog.yaml needs to be added documenting the config changes

Required action:

Add an entry to perf-changelog.yaml documenting:


@functionstackx
Copy link
Copy Markdown
Contributor

@pr-claude can u draft an inline diff within the github comment suggestion on what the perf-changelog.yaml change should be?

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Jan 22, 2026

Claude finished @functionstackx's task in 39s —— View job


Drafting perf-changelog.yaml Entry

  • Read current perf-changelog.yaml to understand format
  • Read nvidia-master.yaml changes to understand what was modified
  • Draft suggested perf-changelog.yaml entry

Suggested perf-changelog.yaml Entry

Based on the changes in this PR, here's the entry you should add to perf-changelog.yaml:

- config-keys:
    - dsr1-fp4-gb200-dynamo-trt
    - dsr1-fp4-gb200-dynamo-sglang
  description:
    - "Update Dynamo TRT image from 0.5.1-rc0.pre3 to 0.8.0"
    - "Update Dynamo SGLang image from v0.5.5.post2 to dev-cu13"
    - "Refactor configurations to use CONFIG_FILE-based recipes instead of inline parameter settings"
    - "Add new 1k8k sequence length configurations for SGLang"
    - "Introduce srt-slurm workflow for launching Dynamo jobs"
  pr-link: https://github.com/InferenceMAX/InferenceMAX/pull/510

Key changes documented:

  1. Image updates: TRT image 0.5.1-rc0.pre30.8.0, SGLang image v0.5.5.post2dev-cu13
  2. Config refactor: Moved from inline PREFILL_MAX_NUM_TOKENS, DECODE_MAX_BATCH_SIZE, etc. to external recipe YAML files via CONFIG_FILE
  3. New configs: Added 1k8k (1024x8192) sequence length configurations for dsr1-fp4-gb200-dynamo-sglang
  4. New workflow: srt-slurm integration for job launching (mentioned in PR description)

jthomson04 and others added 3 commits January 22, 2026 14:16
Signed-off-by: jthomson04 <jothomson@nvidia.com>
Signed-off-by: jthomson04 <jothomson@nvidia.com>
@jthomson04
Copy link
Copy Markdown
Collaborator Author

Signed-off-by: jthomson04 <jothomson@nvidia.com>
Signed-off-by: jthomson04 <jothomson@nvidia.com>
Signed-off-by: jthomson04 <jothomson@nvidia.com>
@functionstackx
Copy link
Copy Markdown
Contributor

hi @jthomson04

generally the trtllm parts of this PR is fine, the sglang parts of this PR still need a bit of work which @kedarpotdar-nv and us were discussing.

if u wanna merge this PR without delay, i suggest splitting out the trtllm parts of this away from the sglang parts

+viz @kedarpotdar-nv @cquil11

cquil11 and others added 4 commits January 29, 2026 16:11
Copy link
Copy Markdown
Collaborator

@cquil11 cquil11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. lgtm to me now. thank you

@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented Jan 29, 2026

@jthomson04 please feel free to merge at your convenience

Signed-off-by: jthomson04 <jothomson@nvidia.com>
@kedarpotdar-nv
Copy link
Copy Markdown
Collaborator

Signed-off-by: jthomson04 <jothomson@nvidia.com>
@jthomson04
Copy link
Copy Markdown
Collaborator Author

There were some missing 1k8k configs. Not ready to merge yet; kicked off a new pipeline

@functionstackx
Copy link
Copy Markdown
Contributor

@jthomson04 1k8k takes too long. wanna spit that out to another follow up PR & just merge 1k1k 8k1k first?

@cquil11
Copy link
Copy Markdown
Collaborator

cquil11 commented Jan 30, 2026

@jthomson04 you can also just comment out the other sequence lengths in the master config to test. might be easier that way

@jthomson04
Copy link
Copy Markdown
Collaborator Author

jthomson04 commented Jan 30, 2026

It's halfway done now. Will wait for that to complete before merge. https://github.com/InferenceMAX/InferenceMAX/actions/runs/21523871013

Copy link
Copy Markdown
Contributor

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. yolo! feel free to merge

@jthomson04 jthomson04 merged commit 4e9a376 into main Jan 31, 2026
28 of 66 checks passed
@jthomson04 jthomson04 deleted the dsr1-fp4-gb200-dynamo-trt-260121 branch January 31, 2026 04:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

7 participants