Skip to content

[NVIDIA] Enable VLLM_USE_FLASHINFER_MOE_INT4=1 for Kimi K2.5 INT4 B200#935

Merged
ankursingh-nv merged 5 commits intomainfrom
ankur/kimi-int4-b200-flashinfer-moe
Mar 24, 2026
Merged

[NVIDIA] Enable VLLM_USE_FLASHINFER_MOE_INT4=1 for Kimi K2.5 INT4 B200#935
ankursingh-nv merged 5 commits intomainfrom
ankur/kimi-int4-b200-flashinfer-moe

Conversation

@ankursingh-nv
Copy link
Contributor

@ankursingh-nv ankursingh-nv commented Mar 23, 2026

Summary

  • Enable VLLM_USE_FLASHINFER_MOE_INT4=1 environment variable for the Kimi K2.5 INT4 B200 benchmark script (kimik2.5_int4_b200.sh)
  • Disable prefix caching (--no-enable-prefix-caching) for the vLLM serve command

Changes

  • benchmarks/single_node/kimik2.5_int4_b200.sh: Added export VLLM_USE_FLASHINFER_MOE_INT4=1 and --no-enable-prefix-caching flag
  • perf-changelog.yaml: Added changelog entry for kimik2.5-int4-b200-vllm config

@github-actions
Copy link
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

@ankursingh-nv ankursingh-nv changed the title Enable VLLM_USE_FLASHINFER_MOE_INT4=1 for Kimi K2.5 INT4 B200 [WIP] Enable VLLM_USE_FLASHINFER_MOE_INT4=1 for Kimi K2.5 INT4 B200 Mar 23, 2026
Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — simple one-line env var addition to the benchmark script.

Extended reasoning...

Overview

This PR touches two files: benchmarks/single_node/kimik2.5_int4_b200.sh (adds export VLLM_USE_FLASHINFER_MOE_INT4=1) and perf-changelog.yaml (adds a changelog entry for the change). The scope is minimal and self-contained.

Security risks

None. The change sets an environment variable that controls an internal vLLM kernel selection path for INT4 MoE operations. There are no auth, crypto, permissions, or injection concerns.

Level of scrutiny

Low. This is a benchmark tuning tweak — a single exported env var — following the same pattern used by other benchmark scripts in the repo (e.g., VLLM_MXFP4_USE_MARLIN=1 in GPT-OSS scripts). The changelog entry is boilerplate. The only finding is a nit about pr-link: TBD vs. the established URL convention, which has no runtime impact.

Other factors

No prior reviews on this PR from me. The timeline contains only the automated recipe-reminder bot comment. The change is mechanical and low-risk enough to approve despite the minor changelog style inconsistency flagged in the inline comment.

- kimik2.5-int4-b200-vllm
description:
- "Enable VLLM_USE_FLASHINFER_MOE_INT4=1 for Kimi K2.5 INT4 B200 benchmark"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/934
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new changelog entry uses pr-link: TBD instead of the established placeholder convention used throughout this file; since the PR number is already known (#935), this should be set to #935.

Extended reasoning...

What the bug is: The newly added changelog entry uses pr-link: TBD as a placeholder, which is inconsistent with the convention used by every other entry in the file. All other in-progress or placeholder entries use the full URL pattern https://github.com/SemiAnalysisAI/InferenceX/pull/XXX (e.g., the entries for dsr1-fp8-h200-sglang, minimaxm2.5-fp8-h200-vllm, glm5-fp8-mi355x-sglang, qwen3.5-bf16-mi325x-sglang, qwen3.5-fp8-mi325x-sglang).

The specific code path: The entry at the bottom of perf-changelog.yaml reads:
config-keys: kimik2.5-int4-b200-vllm
description: Enable VLLM_USE_FLASHINFER_MOE_INT4=1 for Kimi K2.5 INT4 B200 benchmark
pr-link: TBD
Every other placeholder entry in the file uses pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX, making this the sole outlier.

Why existing code does not prevent it: The Pydantic model defining the changelog schema declares pr_link as a plain str field with no URL format validation. This means TBD passes schema validation just as well as the XXX URL pattern, so there is no type-level enforcement of the naming convention.

Impact: This is a documentation/style inconsistency only with no runtime impact. However, any tooling that parses pr-link values as URLs (link aggregators, changelog renderers, or automation scripts) would fail or produce unexpected output for this one entry while handling all other entries correctly.

How to fix it: Since this is PR #935, the correct fix is: pr-link: #935. At minimum it should follow the existing XXX convention: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX.

Step-by-step proof:

  1. Searching perf-changelog.yaml for placeholder pr-links matching pr-link.*XXX returns approximately 5 entries, all using the full URL format with XXX as the placeholder.
  2. Searching for pr-link: TBD returns exactly one entry -- the new one added in this PR.
  3. The PR number (935) is already known from the PR metadata, so the actual URL [NVIDIA] Enable VLLM_USE_FLASHINFER_MOE_INT4=1 for Kimi K2.5 INT4 B200 #935 is available and should be used directly.

@cquil11
Copy link
Collaborator

cquil11 commented Mar 24, 2026

@claude add pr description based on perf-changelog additions

@Klaud-Cold
Copy link
Collaborator

Klaud-Cold commented Mar 24, 2026

Claude finished @cquil11's task in 35s —— View job


Adding PR Description

  • Gather context from changed files
  • Update PR description

Updated the PR description based on the perf-changelog additions and benchmark script changes.

@cquil11 cquil11 changed the title [WIP] Enable VLLM_USE_FLASHINFER_MOE_INT4=1 for Kimi K2.5 INT4 B200 [NVIDIA] Enable VLLM_USE_FLASHINFER_MOE_INT4=1 for Kimi K2.5 INT4 B200 Mar 24, 2026
@ankursingh-nv ankursingh-nv merged commit 79ea365 into main Mar 24, 2026
9 of 25 checks passed
@ankursingh-nv ankursingh-nv deleted the ankur/kimi-int4-b200-flashinfer-moe branch March 24, 2026 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

5 participants