[NVIDIA] Enable VLLM_USE_FLASHINFER_MOE_INT4=1 for Kimi K2.5 INT4 B200 by ankursingh-nv · Pull Request #935 · SemiAnalysisAI/InferenceX

ankursingh-nv · 2026-03-23T21:48:48Z

Summary

Enable VLLM_USE_FLASHINFER_MOE_INT4=1 environment variable for the Kimi K2.5 INT4 B200 benchmark script (kimik2.5_int4_b200.sh)
Disable prefix caching (--no-enable-prefix-caching) for the vLLM serve command

Changes

benchmarks/single_node/kimik2.5_int4_b200.sh: Added export VLLM_USE_FLASHINFER_MOE_INT4=1 and --no-enable-prefix-caching flag
perf-changelog.yaml: Added changelog entry for kimik2.5-int4-b200-vllm config

github-actions · 2026-03-23T21:48:59Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

benchmarks/single_node/kimik2.5_int4_b200.sh

claude

LGTM — simple one-line env var addition to the benchmark script.

Extended reasoning...

Overview

This PR touches two files: benchmarks/single_node/kimik2.5_int4_b200.sh (adds export VLLM_USE_FLASHINFER_MOE_INT4=1) and perf-changelog.yaml (adds a changelog entry for the change). The scope is minimal and self-contained.

Security risks

None. The change sets an environment variable that controls an internal vLLM kernel selection path for INT4 MoE operations. There are no auth, crypto, permissions, or injection concerns.

Level of scrutiny

Low. This is a benchmark tuning tweak — a single exported env var — following the same pattern used by other benchmark scripts in the repo (e.g., VLLM_MXFP4_USE_MARLIN=1 in GPT-OSS scripts). The changelog entry is boilerplate. The only finding is a nit about pr-link: TBD vs. the established URL convention, which has no runtime impact.

Other factors

No prior reviews on this PR from me. The timeline contains only the automated recipe-reminder bot comment. The change is mechanical and low-risk enough to approve despite the minor changelog style inconsistency flagged in the inline comment.

claude · 2026-03-23T22:00:14Z

perf-changelog.yaml

+    - kimik2.5-int4-b200-vllm
+  description:
+    - "Enable VLLM_USE_FLASHINFER_MOE_INT4=1 for Kimi K2.5 INT4 B200 benchmark"
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/934


🟡 The new changelog entry uses pr-link: TBD instead of the established placeholder convention used throughout this file; since the PR number is already known (#935), this should be set to #935.

Extended reasoning...

What the bug is: The newly added changelog entry uses pr-link: TBD as a placeholder, which is inconsistent with the convention used by every other entry in the file. All other in-progress or placeholder entries use the full URL pattern https://github.com/SemiAnalysisAI/InferenceX/pull/XXX (e.g., the entries for dsr1-fp8-h200-sglang, minimaxm2.5-fp8-h200-vllm, glm5-fp8-mi355x-sglang, qwen3.5-bf16-mi325x-sglang, qwen3.5-fp8-mi325x-sglang).

The specific code path: The entry at the bottom of perf-changelog.yaml reads:
config-keys: kimik2.5-int4-b200-vllm
description: Enable VLLM_USE_FLASHINFER_MOE_INT4=1 for Kimi K2.5 INT4 B200 benchmark
pr-link: TBD
Every other placeholder entry in the file uses pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX, making this the sole outlier.

Why existing code does not prevent it: The Pydantic model defining the changelog schema declares pr_link as a plain str field with no URL format validation. This means TBD passes schema validation just as well as the XXX URL pattern, so there is no type-level enforcement of the naming convention.

Impact: This is a documentation/style inconsistency only with no runtime impact. However, any tooling that parses pr-link values as URLs (link aggregators, changelog renderers, or automation scripts) would fail or produce unexpected output for this one entry while handling all other entries correctly.

How to fix it: Since this is PR #935, the correct fix is: pr-link: #935. At minimum it should follow the existing XXX convention: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX.

Step-by-step proof:

Searching perf-changelog.yaml for placeholder pr-links matching pr-link.*XXX returns approximately 5 entries, all using the full URL format with XXX as the placeholder.

Searching for pr-link: TBD returns exactly one entry -- the new one added in this PR.

The PR number (935) is already known from the PR metadata, so the actual URL [NVIDIA] Enable VLLM_USE_FLASHINFER_MOE_INT4=1 for Kimi K2.5 INT4 B200 #935 is available and should be used directly.

…flashinfer-moe # Conflicts: # perf-changelog.yaml

cquil11 · 2026-03-24T20:02:21Z

@claude add pr description based on perf-changelog additions

Klaud-Cold · 2026-03-24T20:02:43Z

Claude finished @cquil11's task in 35s —— View job

Adding PR Description

Gather context from changed files
Update PR description

Updated the PR description based on the perf-changelog additions and benchmark script changes.

Enable VLLM_USE_FLASHINFER_MOE_INT4=1 for Kimi K2.5 INT4 B200

31e6065

ankursingh-nv requested a review from a team March 23, 2026 21:48

github-project-automation bot added this to InferenceMAX Board Mar 23, 2026

ankursingh-nv added NVIDIA sweep-enabled labels Mar 23, 2026

Update perf-changelog.yaml

1d0ae45

ankursingh-nv changed the title ~~Enable VLLM_USE_FLASHINFER_MOE_INT4=1 for Kimi K2.5 INT4 B200~~ [WIP] Enable VLLM_USE_FLASHINFER_MOE_INT4=1 for Kimi K2.5 INT4 B200 Mar 23, 2026

functionstackx reviewed Mar 23, 2026

View reviewed changes

benchmarks/single_node/kimik2.5_int4_b200.sh Show resolved Hide resolved

claude bot reviewed Mar 23, 2026

View reviewed changes

Ankur-singh and others added 3 commits March 23, 2026 22:10

Merge remote-tracking branch 'origin/main' into ankur/kimi-int4-b200-…

e163cdf

…flashinfer-moe # Conflicts: # perf-changelog.yaml

diable prefix caching

9be5863

Update perf-changelog.yaml

ca02b57

cquil11 approved these changes Mar 24, 2026

View reviewed changes

cquil11 changed the title ~~[WIP] Enable VLLM_USE_FLASHINFER_MOE_INT4=1 for Kimi K2.5 INT4 B200~~ [NVIDIA] Enable VLLM_USE_FLASHINFER_MOE_INT4=1 for Kimi K2.5 INT4 B200 Mar 24, 2026

ankursingh-nv merged commit 79ea365 into main Mar 24, 2026
9 of 25 checks passed

ankursingh-nv deleted the ankur/kimi-int4-b200-flashinfer-moe branch March 24, 2026 20:03

github-project-automation bot moved this to Done in InferenceMAX Board Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NVIDIA] Enable VLLM_USE_FLASHINFER_MOE_INT4=1 for Kimi K2.5 INT4 B200#935

[NVIDIA] Enable VLLM_USE_FLASHINFER_MOE_INT4=1 for Kimi K2.5 INT4 B200#935
ankursingh-nv merged 5 commits intomainfrom
ankur/kimi-int4-b200-flashinfer-moe

ankursingh-nv commented Mar 23, 2026 •

edited by Klaud-Cold

Loading

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

Uh oh!

claude bot left a comment

Uh oh!

claude bot Mar 23, 2026

Uh oh!

cquil11 commented Mar 24, 2026

Uh oh!

Klaud-Cold commented Mar 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ankursingh-nv commented Mar 23, 2026 • edited by Klaud-Cold Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

claude bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

cquil11 commented Mar 24, 2026

Uh oh!

Klaud-Cold commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Adding PR Description

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ankursingh-nv commented Mar 23, 2026 •

edited by Klaud-Cold

Loading

Klaud-Cold commented Mar 24, 2026 •

edited

Loading