Optimize Kimi-K2.5-FP4 on AMD MI355X: Enable AITER and Expert Parallel by ChuanLi1101 · Pull Request #922 · SemiAnalysisAI/InferenceX

ChuanLi1101 · 2026-03-20T18:38:59Z

Summary

Enable AITER acceleration for Kimi-K2.5-FP4 on MI355X, including MLA, MoE, Triton RoPE, and INT8 quick-reduce quantization (VLLM_ROCM_USE_AITER=1, VLLM_ROCM_USE_AITER_MLA=1, VLLM_ROCM_USE_AITER_MOE=1, etc.)
Add expert parallel support with --enable-expert-parallel when EP_SIZE > 1, and add a corresponding tp: 4, ep: 4 search-space entry in the benchmark config for ISL=1024/OSL=8192
Reduce tensor parallelism from 8 to 4 across all benchmark search spaces, allowing better per-GPU utilization
Tune serving parameters: lower --gpu-memory-utilization from 0.95 to 0.90 for stability, change --block-size from 64 to 1
Add MEC firmware version check to conditionally disable scratch reclaim (HSA_NO_SCRATCH_RECLAIM=1) on older firmware (< 177) to avoid crashes

Changed Files

File	Description
`.github/configs/amd-master.yaml`	Update `kimik2.5-fp4-mi355x-vllm` search spaces: tp 8->4, add ep=4 config
`benchmarks/single_node/kimik2.5_fp4_mi355x.sh`	Enable AITER env vars, add expert parallel flag, MEC FW check, tune vLLM args

Test Plan

Run kimik2.5-fp4-mi355x-vllm benchmark suite on MI355X with the updated config
Verify expert parallel mode (EP_SIZE > 1) launches correctly with --enable-expert-parallel
Validate AITER MLA/MoE kernels are active via vLLM logs
Confirm no OOM or crash with --gpu-memory-utilization 0.90 and --block-size 1
Test on systems with MEC FW < 177 to verify the HSA_NO_SCRATCH_RECLAIM guard works

benchmarks/single_node/kimik2.5_fp4_mi355x.sh

chunfangamd

lgtm

benchmarks/single_node/kimik2.5_fp4_mi355x.sh

…pace - Remove redundant VLLM_ROCM_USE_AITER_MLA=1 and VLLM_ROCM_USE_AITER_MOE=1 (both default to True in vllm envs.py, only master switch needed) - Remove VLLM_ROCM_USE_AITER_TRITON_ROPE=1 (noop without --compilation-config custom_ops+=+rotary_embedding) - Switch VLLM_ROCM_QUICK_REDUCE_QUANTIZATION from INT8 to INT4 for better TTFT/TPOT (2.2x vs 1.17x per quickreduce benchmarks) - Add TP8EP1 back to all search spaces alongside TP4EP1 and TP4EP4 so InferenceX can sweep and determine optimal config empirically Made-with: Cursor

seungrokj · 2026-03-23T00:22:02Z

/sweep test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys kimik2.5-fp4-mi355x-vllm

github-actions · 2026-03-23T00:22:29Z

@seungrokj Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23416123849
Command: test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys kimik2.5-fp4-mi355x-vllm
Pinned ref: 5410ce5
Approval: not required (trusted collaborator).

seungrokj · 2026-03-23T04:24:02Z

benchmarks/single_node/kimik2.5_fp4_mi355x.sh

 --max-model-len $MAX_MODEL_LEN \
--block-size=64 \
--disable-log-requests \
+--block-size=1 \


--no-enable-prefix-caching \

@ChuanLi1101 now we need this

seungrokj · 2026-03-23T14:09:15Z

.github/configs/amd-master.yaml

    - { tp: 8, conc-start: 4, conc-end: 64 }

 kimik2.5-fp4-mi355x-vllm:
  image: vllm/vllm-openai-rocm:v0.16.0


@ChuanLi1101
vllm/vllm-openai-rocm:v0.18.0

seems to give an extra perf gain.

ChuanLi1101 · 2026-03-24T09:01:57Z

Superseded by #936 which addresses all review feedback (--no-enable-prefix-caching, image update to v0.18.0) and includes additional optimizations.

lxgsbqylbk added 2 commits March 19, 2026 10:57

optimize kimi-k2.5-fp4 on amd mi355x gpu

6a21687

add expert parallel for kimik2.5-fp4-mi355x

d404078

ChuanLi1101 requested a review from a team March 20, 2026 18:39

ChuanLi1101 requested review from billishyahao and chunfangamd as code owners March 20, 2026 18:39

github-project-automation bot added this to InferenceMAX Board Mar 20, 2026

functionstackx reviewed Mar 20, 2026

View reviewed changes

benchmarks/single_node/kimik2.5_fp4_mi355x.sh Show resolved Hide resolved

chunfangamd approved these changes Mar 20, 2026

View reviewed changes

benchmarks/single_node/kimik2.5_fp4_mi355x.sh Show resolved Hide resolved

ChuanLi1101 mentioned this pull request Mar 20, 2026

Add AMD ROCm (MI355X) deployment guide for Kimi-K2.5-MXFP4 vllm-project/recipes#296

Open

3 tasks

functionstackx requested review from Oseltamivir, cquil11 and functionstackx and removed request for functionstackx March 21, 2026 05:15

billishyahao added AMD sweep-enabled labels Mar 23, 2026

Merge branch 'main' into dev/rocm

4084eeb

seungrokj reviewed Mar 23, 2026

View reviewed changes

ChuanLi1101 mentioned this pull request Mar 24, 2026

Optimize Kimi-K2.5-MXFP4 on MI355X: Enable AITER, Expert Parallel, and update to vLLM v0.18.0 #936

Open

1 task

ChuanLi1101 closed this Mar 24, 2026

github-project-automation bot moved this to Done in InferenceMAX Board Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Kimi-K2.5-FP4 on AMD MI355X: Enable AITER and Expert Parallel#922

Optimize Kimi-K2.5-FP4 on AMD MI355X: Enable AITER and Expert Parallel#922
ChuanLi1101 wants to merge 4 commits intoSemiAnalysisAI:mainfrom
ChuanLi1101:dev/rocm

ChuanLi1101 commented Mar 20, 2026

Uh oh!

Uh oh!

chunfangamd left a comment

Uh oh!

Uh oh!

seungrokj commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

seungrokj Mar 23, 2026

Uh oh!

seungrokj Mar 23, 2026

Uh oh!

ChuanLi1101 commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

ChuanLi1101 commented Mar 20, 2026

Summary

Changed Files

Test Plan

Uh oh!

Uh oh!

chunfangamd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

seungrokj commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026

Uh oh!

seungrokj Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

seungrokj Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

ChuanLi1101 commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants