Skip to content

Optimize Kimi-K2.5-FP4 on AMD MI355X: Enable AITER and Expert Parallel#922

Closed
ChuanLi1101 wants to merge 4 commits intoSemiAnalysisAI:mainfrom
ChuanLi1101:dev/rocm
Closed

Optimize Kimi-K2.5-FP4 on AMD MI355X: Enable AITER and Expert Parallel#922
ChuanLi1101 wants to merge 4 commits intoSemiAnalysisAI:mainfrom
ChuanLi1101:dev/rocm

Conversation

@ChuanLi1101
Copy link
Contributor

Summary

  • Enable AITER acceleration for Kimi-K2.5-FP4 on MI355X, including MLA, MoE, Triton RoPE, and INT8 quick-reduce quantization (VLLM_ROCM_USE_AITER=1, VLLM_ROCM_USE_AITER_MLA=1, VLLM_ROCM_USE_AITER_MOE=1, etc.)
  • Add expert parallel support with --enable-expert-parallel when EP_SIZE > 1, and add a corresponding tp: 4, ep: 4 search-space entry in the benchmark config for ISL=1024/OSL=8192
  • Reduce tensor parallelism from 8 to 4 across all benchmark search spaces, allowing better per-GPU utilization
  • Tune serving parameters: lower --gpu-memory-utilization from 0.95 to 0.90 for stability, change --block-size from 64 to 1
  • Add MEC firmware version check to conditionally disable scratch reclaim (HSA_NO_SCRATCH_RECLAIM=1) on older firmware (< 177) to avoid crashes

Changed Files

File Description
.github/configs/amd-master.yaml Update kimik2.5-fp4-mi355x-vllm search spaces: tp 8->4, add ep=4 config
benchmarks/single_node/kimik2.5_fp4_mi355x.sh Enable AITER env vars, add expert parallel flag, MEC FW check, tune vLLM args

Test Plan

  • Run kimik2.5-fp4-mi355x-vllm benchmark suite on MI355X with the updated config
  • Verify expert parallel mode (EP_SIZE > 1) launches correctly with --enable-expert-parallel
  • Validate AITER MLA/MoE kernels are active via vLLM logs
  • Confirm no OOM or crash with --gpu-memory-utilization 0.90 and --block-size 1
  • Test on systems with MEC FW < 177 to verify the HSA_NO_SCRATCH_RECLAIM guard works

Copy link
Collaborator

@chunfangamd chunfangamd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

…pace

- Remove redundant VLLM_ROCM_USE_AITER_MLA=1 and VLLM_ROCM_USE_AITER_MOE=1
  (both default to True in vllm envs.py, only master switch needed)
- Remove VLLM_ROCM_USE_AITER_TRITON_ROPE=1 (noop without
  --compilation-config custom_ops+=+rotary_embedding)
- Switch VLLM_ROCM_QUICK_REDUCE_QUANTIZATION from INT8 to INT4
  for better TTFT/TPOT (2.2x vs 1.17x per quickreduce benchmarks)
- Add TP8EP1 back to all search spaces alongside TP4EP1 and TP4EP4
  so InferenceX can sweep and determine optimal config empirically

Made-with: Cursor
@seungrokj
Copy link
Collaborator

/sweep test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys kimik2.5-fp4-mi355x-vllm

@github-actions
Copy link
Contributor

@seungrokj Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23416123849
Command: test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys kimik2.5-fp4-mi355x-vllm
Pinned ref: 5410ce5
Approval: not required (trusted collaborator).

--max-model-len $MAX_MODEL_LEN \
--block-size=64 \
--disable-log-requests \
--block-size=1 \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--no-enable-prefix-caching \

@ChuanLi1101 now we need this

- { tp: 8, conc-start: 4, conc-end: 64 }

kimik2.5-fp4-mi355x-vllm:
image: vllm/vllm-openai-rocm:v0.16.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChuanLi1101
vllm/vllm-openai-rocm:v0.18.0

seems to give an extra perf gain.

@ChuanLi1101
Copy link
Contributor Author

Superseded by #936 which addresses all review feedback (--no-enable-prefix-caching, image update to v0.18.0) and includes additional optimizations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

6 participants