[NVIDIA] Enable VLLM_USE_FLASHINFER_MOE_INT4=1 for Kimi K2.5 INT4 B200#935
[NVIDIA] Enable VLLM_USE_FLASHINFER_MOE_INT4=1 for Kimi K2.5 INT4 B200#935ankursingh-nv merged 5 commits intomainfrom
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
There was a problem hiding this comment.
LGTM — simple one-line env var addition to the benchmark script.
Extended reasoning...
Overview
This PR touches two files: benchmarks/single_node/kimik2.5_int4_b200.sh (adds export VLLM_USE_FLASHINFER_MOE_INT4=1) and perf-changelog.yaml (adds a changelog entry for the change). The scope is minimal and self-contained.
Security risks
None. The change sets an environment variable that controls an internal vLLM kernel selection path for INT4 MoE operations. There are no auth, crypto, permissions, or injection concerns.
Level of scrutiny
Low. This is a benchmark tuning tweak — a single exported env var — following the same pattern used by other benchmark scripts in the repo (e.g., VLLM_MXFP4_USE_MARLIN=1 in GPT-OSS scripts). The changelog entry is boilerplate. The only finding is a nit about pr-link: TBD vs. the established URL convention, which has no runtime impact.
Other factors
No prior reviews on this PR from me. The timeline contains only the automated recipe-reminder bot comment. The change is mechanical and low-risk enough to approve despite the minor changelog style inconsistency flagged in the inline comment.
perf-changelog.yaml
Outdated
| - kimik2.5-int4-b200-vllm | ||
| description: | ||
| - "Enable VLLM_USE_FLASHINFER_MOE_INT4=1 for Kimi K2.5 INT4 B200 benchmark" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/934 |
There was a problem hiding this comment.
🟡 The new changelog entry uses pr-link: TBD instead of the established placeholder convention used throughout this file; since the PR number is already known (#935), this should be set to #935.
Extended reasoning...
What the bug is: The newly added changelog entry uses pr-link: TBD as a placeholder, which is inconsistent with the convention used by every other entry in the file. All other in-progress or placeholder entries use the full URL pattern https://github.com/SemiAnalysisAI/InferenceX/pull/XXX (e.g., the entries for dsr1-fp8-h200-sglang, minimaxm2.5-fp8-h200-vllm, glm5-fp8-mi355x-sglang, qwen3.5-bf16-mi325x-sglang, qwen3.5-fp8-mi325x-sglang).
The specific code path: The entry at the bottom of perf-changelog.yaml reads:
config-keys: kimik2.5-int4-b200-vllm
description: Enable VLLM_USE_FLASHINFER_MOE_INT4=1 for Kimi K2.5 INT4 B200 benchmark
pr-link: TBD
Every other placeholder entry in the file uses pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX, making this the sole outlier.
Why existing code does not prevent it: The Pydantic model defining the changelog schema declares pr_link as a plain str field with no URL format validation. This means TBD passes schema validation just as well as the XXX URL pattern, so there is no type-level enforcement of the naming convention.
Impact: This is a documentation/style inconsistency only with no runtime impact. However, any tooling that parses pr-link values as URLs (link aggregators, changelog renderers, or automation scripts) would fail or produce unexpected output for this one entry while handling all other entries correctly.
How to fix it: Since this is PR #935, the correct fix is: pr-link: #935. At minimum it should follow the existing XXX convention: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX.
Step-by-step proof:
- Searching perf-changelog.yaml for placeholder pr-links matching pr-link.*XXX returns approximately 5 entries, all using the full URL format with XXX as the placeholder.
- Searching for pr-link: TBD returns exactly one entry -- the new one added in this PR.
- The PR number (935) is already known from the PR metadata, so the actual URL [NVIDIA] Enable VLLM_USE_FLASHINFER_MOE_INT4=1 for Kimi K2.5 INT4 B200 #935 is available and should be used directly.
…flashinfer-moe # Conflicts: # perf-changelog.yaml
|
@claude add pr description based on perf-changelog additions |
Summary
VLLM_USE_FLASHINFER_MOE_INT4=1environment variable for the Kimi K2.5 INT4 B200 benchmark script (kimik2.5_int4_b200.sh)--no-enable-prefix-caching) for the vLLM serve commandChanges
benchmarks/single_node/kimik2.5_int4_b200.sh: Addedexport VLLM_USE_FLASHINFER_MOE_INT4=1and--no-enable-prefix-cachingflagperf-changelog.yaml: Added changelog entry forkimik2.5-int4-b200-vllmconfig