Optimize Kimi-K2.5-MXFP4 on MI355X: Enable AITER, Expert Parallel, and update to vLLM v0.18.0#936

Open

ChuanLi1101 wants to merge 7 commits intoSemiAnalysisAI:mainfrom

ChuanLi1101:chuan/kimik25_mxfp4_mi355x_vllm

Contributor

ChuanLi1101 commented Mar 24, 2026

Summary

Supersedes #922 with review feedback addressed.

Enable AITER acceleration for Kimi-K2.5-FP4 on MI355X (VLLM_ROCM_USE_AITER=1, VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4)
Add expert parallel support with --enable-expert-parallel when EP_SIZE > 1, and add corresponding tp: 4, ep: 4 search-space entries
Add TP=4 search spaces across all benchmark configurations (ISL/OSL combos)
Update vLLM image from v0.16.0 to v0.18.0 for extra perf gains (addresses @seungrokj's review comment)
Add --no-enable-prefix-caching flag (addresses @seungrokj's review comment)
Tune serving parameters: lower --gpu-memory-utilization from 0.95 to 0.90 for stability, change --block-size from 64 to 1
Add MEC firmware version check to conditionally set HSA_NO_SCRATCH_RECLAIM=1 on older firmware (< 177) to avoid crashes

Changes from original PR #922

Cleaned up redundant AITER env vars (VLLM_ROCM_USE_AITER=1 enables all sub-modules automatically)
Switched VLLM_ROCM_QUICK_REDUCE_QUANTIZATION from INT8 to INT4 to match MXFP4 quantization
Added --no-enable-prefix-caching per reviewer request
Updated image to vllm/vllm-openai-rocm:v0.18.0 per reviewer request

Changed Files

File	Description
`.github/configs/amd-master.yaml`	Update image to v0.18.0, add tp=4 and tp=4/ep=4 search spaces
`benchmarks/single_node/kimik2.5_fp4_mi355x.sh`	Enable AITER, add expert parallel, MEC FW check, tune vLLM args

Test Plan

e2e Test running: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23448132756 (all completed jobs passing)

lxgsbqylbk and others added 5 commits

March 23, 2026 16:25


          optimize kimi-k2.5-fp4 on amd mi355x gpu

87f3c74


          add expert parallel for kimik2.5-fp4-mi355x

d26c7c9


          Clean up AITER env vars, switch quickreduce to INT4, add TP8 search s…

26b127b

…pace

- Remove redundant VLLM_ROCM_USE_AITER_MLA=1 and VLLM_ROCM_USE_AITER_MOE=1
  (both default to True in vllm envs.py, only master switch needed)
- Remove VLLM_ROCM_USE_AITER_TRITON_ROPE=1 (noop without
  --compilation-config custom_ops+=+rotary_embedding)
- Switch VLLM_ROCM_QUICK_REDUCE_QUANTIZATION from INT8 to INT4
  for better TTFT/TPOT (2.2x vs 1.17x per quickreduce benchmarks)
- Add TP8EP1 back to all search spaces alongside TP4EP1 and TP4EP4
  so InferenceX can sweep and determine optimal config empirically

Made-with: Cursor


          Update kimi2.5-fp4-mi355x-vll image from v0.16.0 to v.18.0

8e2adec

Apply `--no-enable-prefix-caching`


          Fix the image mistake

4dfed36

ChuanLi1101 requested a review from a team

March 24, 2026 08:56

ChuanLi1101 requested review from billishyahao and chunfangamd as code owners

March 24, 2026 08:56

github-project-automation bot added this to InferenceMAX Board

ChuanLi1101 mentioned this pull request

Optimize Kimi-K2.5-FP4 on AMD MI355X: Enable AITER and Expert Parallel #922

Closed

5 tasks


          Merge branch 'main' into chuan/kimik25_mxfp4_mi355x_vllm

a720974

chunfangamd approved these changes

View reviewed changes

Collaborator

chunfangamd left a comment

LGTM

chunfangamd enabled auto-merge (squash)

March 24, 2026 09:05

claude bot reviewed

View reviewed changes

benchmarks/single_node/kimik2.5_fp4_mi355x.sh

Comment on lines 59 to 68

    
              set -x

              vllm serve $MODEL --port $PORT \

              --tensor-parallel-size=$TP \

              --gpu-memory-utilization 0.95 \

              $EP \

              --gpu-memory-utilization 0.90 \

              --max-model-len $MAX_MODEL_LEN \

              --block-size=64 \

              --disable-log-requests \

              --block-size=1 \

              --no-enable-prefix-caching \

              --trust-remote-code \

              --mm-encoder-tp-mode data > $SERVER_LOG 2>&1 &

Contributor

claude bot Mar 24, 2026

🔴 --disable-log-requests was present in the previous version of this script but is missing from the updated vllm serve command, likely dropped accidentally during the multi-flag edit that swapped --block-size=64 and --disable-log-requests for --block-size=1 and --no-enable-prefix-caching. Without it, vLLM logs every incoming request to $SERVER_LOG, adding file I/O overhead during high-concurrency benchmark runs (up to conc=64, 640 prompts) that can inflate latency measurements and skew throughput numbers. Consider adding --disable-log-requests back to the vllm serve invocation.

Extended reasoning...

What the bug is: The vllm serve command in kimik2.5_fp4_mi355x.sh (lines 59-68) is missing the --disable-log-requests flag. Without this flag, vLLM logs metadata for every incoming request, which in this script is redirected to $SERVER_LOG via > $SERVER_LOG 2>&1.

How the removal happened: The diff shows the original script had both --block-size=64 and --disable-log-requests on adjacent lines. The PR replaced --block-size=64 with --block-size=1 and added --no-enable-prefix-caching, but --disable-log-requests was silently dropped in the process. The PR description explicitly calls out adding --no-enable-prefix-caching and changing --block-size, but makes no mention of intentionally removing --disable-log-requests, which is strong evidence the removal was accidental.

Why existing code does not prevent it: There is no shared configuration or base function that enforces --disable-log-requests; each script independently assembles its vllm serve flags. Omissions must be caught by comparing siblings during review.

Impact: At the benchmark's maximum concurrency (conc=64, generating CONC * 10 = 640 prompts per run), vLLM produces one log line per request. Even though these go to a file rather than stdout, writing them during the measurement window adds disk I/O that competes with the benchmark workload, potentially inflating latency percentiles and reducing apparent throughput. The effect is more pronounced on ISL=8192 runs where per-request processing takes longer and log entries are more detailed.

Step-by-step proof: (1) Before this PR, kimik2.5_fp4_mi355x.sh contained --disable-log-requests — visible as a - deletion line in the diff. (2) The PR rewrites the flag block; --disable-log-requests does not appear in any + line. (3) A repo-wide grep confirms 11 other vLLM benchmark scripts consistently include this flag: kimik2.5_int4_mi355x.sh:38, kimik2.5_int4_mi325x.sh:41, minimaxm2.5_fp8_mi300x.sh:40, minimaxm2.5_fp8_mi325x.sh:42, minimaxm2.5_fp8_h200.sh:40, minimaxm2.5_fp8_h100.sh:45, gptoss_fp4_b200.sh:57, gptoss_fp4_h100.sh:41, gptoss_fp4_h200.sh:54, kimik2.5_int4_b200.sh:41, kimik2.5_int4_h200.sh:43. (4) With conc=64, 640 requests fire per run, each generating a log entry in $SERVER_LOG, adding file-write pressure during the measurement window.

Addressing the refutation: One verifier argues the removal may be intentional, citing that minimaxm2.5_fp8_mi355x.sh (also vLLM v0.18.0 on MI355X) also lacks --disable-log-requests, suggesting an emergent MI355X convention. This observation is factually correct — that script indeed omits the flag. However, the critical distinction is that kimik2.5_fp4_mi355x.sh previously had the flag and the diff shows it was actively removed, whereas minimaxm2.5_fp8_mi355x.sh appears to have never included it (likely its own inadvertent omission from the start). The PR author explicitly documenting every other change but not this removal, combined with the adjacent-line edit pattern, strongly favors accidental omission over intentional design. The refuter's point that logs go to a file rather than stdout is valid and reduces severity compared to unbuffered console output, but does not eliminate the I/O overhead concern under high-concurrency workloads.

Fix: Add --disable-log-requests back to the vllm serve call to match the 11 other vLLM benchmark scripts in the repository.

benchmarks/single_node/kimik2.5_fp4_mi355x.sh Outdated

Comment on lines +47 to +51

+              if [ "$EP_SIZE" -gt 1 ]; then
+                EP=" --enable-expert-parallel"
+              else
+                EP=" "
+              fi

Contributor

claude bot Mar 24, 2026

🟡 The new EP_SIZE check at line 47 uses single-bracket [ "$EP_SIZE" -gt 1 ], which emits [: : integer expression expected to stderr whenever EP_SIZE is empty or unset; fix with [ "${EP_SIZE:-0}" -gt 1 ] or [[ ${EP_SIZE:-0} -gt 1 ]].

Extended reasoning...

What the bug is: At kimik2.5_fp4_mi355x.sh:47, the newly added expert-parallel check uses POSIX single-bracket syntax. POSIX [ ] requires both operands of -gt to be non-empty integers; when EP_SIZE is empty or unset, bash prints '[: : integer expression expected' to stderr and exits the condition with a non-zero status, causing the else branch (no expert parallel) to be taken.

The specific code path: The new YAML search-space entries include configs without an ep field (e.g., { tp: 8, conc-start: 4, conc-end: 64 } and { tp: 4, conc-start: 4, conc-end: 64 }). For those runs, EP_SIZE is either not exported to the shell environment or is exported as empty. The check on line 47 fires and produces the stderr noise on every such invocation.

Why existing code does not prevent it: EP_SIZE is intentionally absent from the check_env_vars call at the top of the script, because it is genuinely optional - only the { tp: 4, ep: 4, ... } entry needs it. There is no set -e in the script, so execution continues to the else branch correctly, but the spurious diagnostic is still emitted.

Addressing the refutation: One verifier argues the CI sweep generator (generate_sweep_configs.py line 677) always supplies EP_SIZE=1 as a default when ep is absent, meaning the error never occurs in automated CI runs. This is plausible - if that default is reliably enforced by the pipeline, the bash error is never triggered in practice. However, the script is also invocable manually for local testing or ad-hoc runs where EP_SIZE may not be set, and the defensive pattern costs nothing to adopt.

Step-by-step proof (manual invocation):

Set only the required env vars: MODEL=... TP=8 CONC=4 ISL=1024 OSL=1024 MAX_MODEL_LEN=... etc.
Leave EP_SIZE unset.
Run: bash kimik2.5_fp4_mi355x.sh
Bash reaches line 47: [ "" -gt 1 ] emits 'bash: [: : integer expression expected' to stderr.
Execution continues with EP=" " (no expert parallel) - correct behavior, but the spurious error was still emitted.

How to fix: Replace line 47 with: if [ "${EP_SIZE:-0}" -gt 1 ] (default to 0 when unset) or equivalently: if (( ${EP_SIZE:-0} > 1 )). Both handle empty or unset EP_SIZE silently without changing any observable behavior.

benchmarks/single_node/kimik2.5_fp4_mi355x.sh

Comment on lines +39 to +42

+              version=`rocm-smi --showfw | grep MEC | head -n 1 |  awk '{print $NF}'`
+              if [[ "$version" == "" || $version -lt 177 ]]; then
+                export HSA_NO_SCRATCH_RECLAIM=1
+              fi

Contributor

claude bot Mar 24, 2026

🟡 The MEC firmware version check at line 40 fails to set HSA_NO_SCRATCH_RECLAIM=1 when rocm-smi returns a non-numeric string (e.g., N/A): the empty-string guard is false, and bash arithmetic comparison -lt on a non-integer operand inside [[ ]] returns false, so the short-circuit OR is false overall and the safety flag is never set. Since HSA_NO_SCRATCH_RECLAIM=1 prevents crashes on older firmware, the safe default for an indeterminate version should be to set it; fix by adding a regex guard: [[ "$version" == "" || ! "$version" =~ ^[0-9]+$ || $version -lt 177 ]].

Extended reasoning...

Bug Analysis

The newly added MEC firmware version check at benchmarks/single_node/kimik2.5_fp4_mi355x.sh:39-42 reads:

version=$(rocm-smi --showfw | grep MEC | head -n 1 | awk '{print $NF}')
if [[ "$version" == "" || $version -lt 177 ]]; then
  export HSA_NO_SCRATCH_RECLAIM=1
fi

How the Bug Manifests

The condition uses short-circuit OR with two branches: "$version" == "" handles the empty case, and $version -lt 177 handles the numeric comparison. A third case is unhandled: when version is a non-empty, non-numeric string such as N/A or unknown — common sentinels in ROCm tooling.

Inside bash [[ ]], the -lt operator performs arithmetic comparison. When the left-hand operand is a non-integer string, bash either silently coerces it to 0 or produces an error and returns false. The verifiers confirmed that for non-integer strings the comparison returns false in practice.

The Specific Code Path

rocm-smi --showfw outputs a firmware table; grep MEC | head -n 1 | awk '{print $NF}' extracts the last field of the first MEC line.
If that field is N/A, version is set to N/A.
[[ "N/A" == "" ]] evaluates to false (not empty).
[[ N/A -lt 177 ]] — bash arithmetic on a non-integer returns false.
Short-circuit OR: false || false → false — the if body is skipped.
HSA_NO_SCRATCH_RECLAIM is never exported.

Why Existing Code Does Not Prevent It

The empty-string guard correctly covers the case where awk produces no output. It does not cover non-empty, non-numeric values. There is no type check or regex validation before using -lt on $version.

Impact

On systems where rocm-smi --showfw returns a non-numeric firmware version field, HSA_NO_SCRATCH_RECLAIM=1 will silently not be set. Per the script comment, this flag is needed on MEC firmware older than version 177 to prevent RCCL scratch-memory reclaim crashes. A machine with old firmware but a non-numeric version string in rocm-smi output would be left in the crash-prone configuration without any warning. The practical risk is low since rocm-smi output is normally numeric on production hardware, but the violated fail-safe assumption is a real defect.

Step-by-Step Proof

rocm-smi --showfw on a specific machine outputs a MEC line whose last field is N/A.
version is assigned the string N/A.
[[ "N/A" == "" ]] evaluates to false.
[[ N/A -lt 177 ]] — bash arithmetic comparison with non-integer — evaluates to false.
Entire OR condition is false; if body is not entered.
HSA_NO_SCRATCH_RECLAIM remains unset, leaving potential for crash on older firmware.

Recommended Fix

if [[ "$version" == "" || ! "$version" =~ ^[0-9]+$ || $version -lt 177 ]]; then
  export HSA_NO_SCRATCH_RECLAIM=1
fi

The added ! "$version" =~ ^[0-9]+$ clause ensures that any non-numeric (indeterminate) version string causes the safety flag to be set, correctly implementing the fail-safe default.

benchmarks/single_node/kimik2.5_fp4_mi355x.sh

Comment on lines +40 to +42

+              if [[ "$version" == "" || $version -lt 177 ]]; then
+                export HSA_NO_SCRATCH_RECLAIM=1
+              fi

Contributor

claude bot Mar 24, 2026

🟡 The MEC firmware version check at line 41 fails silently when rocm-smi returns a non-numeric string (e.g., N/A): the arithmetic comparison $version -lt 177 errors in bash [[ ]] context, causing the overall OR condition to return false and leaving HSA_NO_SCRATCH_RECLAIM unset. Since the comment explains this flag prevents crashes on older/unknown firmware, the safe default for indeterminate versions should be to set the flag — add a regex guard like ! "$version" =~ ^[0-9]+$ as the middle condition.

Extended reasoning...

The firmware version check is intended to protect against crashes on older MEC firmware by setting HSA_NO_SCRATCH_RECLAIM=1. The check reads:

version=`rocm-smi --showfw | grep MEC | head -n 1 | awk '{print $NF}'`
if [[ "$version" == "" || $version -lt 177 ]]; then
  export HSA_NO_SCRATCH_RECLAIM=1
fi

How the bug manifests: If rocm-smi --showfw outputs a non-numeric last field for the MEC line — for example N/A, unknown, or a version string with a non-numeric suffix — the empty-string guard passes (version is not empty), but the second operand $version -lt 177 attempts a bash integer arithmetic comparison on a non-integer. Inside [[ ]], this silently returns false (bash evaluates it as an error/false rather than raising a fatal error). Due to short-circuit OR semantics, the overall condition is false, so HSA_NO_SCRATCH_RECLAIM is never exported.

Why existing code doesn't prevent it: The empty string check "$version" == "" only handles the case where awk produces no output. It does not guard against non-integer strings like N/A that awk could reasonably extract as the last field of rocm-smi output. Bash integer comparison $var -lt N inside [[ ]] does NOT raise an error that propagates — it simply evaluates to false, making the entire OR condition false.

Concrete proof: Suppose rocm-smi --showfw outputs a line like ... MEC ... N/A. After awk '{print $NF}', version=N/A. Then:

[[ "N/A" == "" ]] → false (short-circuit does NOT skip second condition)
[[ N/A -lt 177 ]] → bash arithmetic error, condition evaluates to false
Overall [[ false || false ]] → false
HSA_NO_SCRATCH_RECLAIM=1 is NOT set

The machine may then experience the RCCL memory reclaim crashes the flag was meant to prevent.

Impact: When firmware version is indeterminate (non-numeric), the workaround is silently skipped. The safe/fail-safe behavior should be to apply the workaround when the version cannot be confirmed to be ≥ 177.

Fix: Add a numeric guard as the middle condition:

if [[ "$version" == "" || ! "$version" =~ ^[0-9]+$ || $version -lt 177 ]]; then
  export HSA_NO_SCRATCH_RECLAIM=1
fi

This ensures: empty string → set flag, non-numeric string → set flag, numeric < 177 → set flag, numeric ≥ 177 → do not set flag.

seungrokj reviewed

View reviewed changes

benchmarks/single_node/kimik2.5_fp4_mi355x.sh


		export VLLM_ROCM_USE_AITER=1
		export VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT4

Collaborator

seungrokj Mar 24, 2026

if [ "${TP}" -lt 8 ]; then
export VLLM_ROCM_USE_AITER_RMSNORM=0
fi

@ChuanLi1101 we may need this

Contributor Author

ChuanLi1101 Mar 24, 2026

Done! Added the AITER RMSNorm disable for TP < 8 in commit a2d37a7. Also applied a defensive default for EP_SIZE. Thanks @seungrokj!


          Disable AITER RMSNorm for TP < 8 due to accuracy issues, fix EP_SIZE …

a2d37a7

…default

Made-with: Cursor

auto-merge was automatically disabled

March 24, 2026 10:03

Head branch was pushed to by a user without write access

Collaborator

seungrokj commented Mar 24, 2026

/sweep test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys kimik2.5-fp4-mi355x-vllm

Contributor

github-actions bot commented Mar 24, 2026

@seungrokj Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23483872729
Command: test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys kimik2.5-fp4-mi355x-vllm
Pinned ref: a2d37a7
Approval: not required (trusted collaborator).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet