Optimize Kimi-K2.5-FP4 on AMD MI355X: Enable AITER and Expert Parallel#922
Optimize Kimi-K2.5-FP4 on AMD MI355X: Enable AITER and Expert Parallel#922ChuanLi1101 wants to merge 4 commits intoSemiAnalysisAI:mainfrom
Conversation
…pace - Remove redundant VLLM_ROCM_USE_AITER_MLA=1 and VLLM_ROCM_USE_AITER_MOE=1 (both default to True in vllm envs.py, only master switch needed) - Remove VLLM_ROCM_USE_AITER_TRITON_ROPE=1 (noop without --compilation-config custom_ops+=+rotary_embedding) - Switch VLLM_ROCM_QUICK_REDUCE_QUANTIZATION from INT8 to INT4 for better TTFT/TPOT (2.2x vs 1.17x per quickreduce benchmarks) - Add TP8EP1 back to all search spaces alongside TP4EP1 and TP4EP4 so InferenceX can sweep and determine optimal config empirically Made-with: Cursor
|
/sweep test-config --config-files .github/configs/amd-master.yaml --runner-config .github/configs/runners.yaml --config-keys kimik2.5-fp4-mi355x-vllm |
|
@seungrokj Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23416123849 |
| --max-model-len $MAX_MODEL_LEN \ | ||
| --block-size=64 \ | ||
| --disable-log-requests \ | ||
| --block-size=1 \ |
There was a problem hiding this comment.
--no-enable-prefix-caching \
@ChuanLi1101 now we need this
| - { tp: 8, conc-start: 4, conc-end: 64 } | ||
|
|
||
| kimik2.5-fp4-mi355x-vllm: | ||
| image: vllm/vllm-openai-rocm:v0.16.0 |
There was a problem hiding this comment.
@ChuanLi1101
vllm/vllm-openai-rocm:v0.18.0
seems to give an extra perf gain.
|
Superseded by #936 which addresses all review feedback (--no-enable-prefix-caching, image update to v0.18.0) and includes additional optimizations. |
Summary
VLLM_ROCM_USE_AITER=1,VLLM_ROCM_USE_AITER_MLA=1,VLLM_ROCM_USE_AITER_MOE=1, etc.)--enable-expert-parallelwhenEP_SIZE > 1, and add a correspondingtp: 4, ep: 4search-space entry in the benchmark config for ISL=1024/OSL=8192--gpu-memory-utilizationfrom 0.95 to 0.90 for stability, change--block-sizefrom 64 to 1HSA_NO_SCRATCH_RECLAIM=1) on older firmware (< 177) to avoid crashesChanged Files
.github/configs/amd-master.yamlkimik2.5-fp4-mi355x-vllmsearch spaces: tp 8->4, add ep=4 configbenchmarks/single_node/kimik2.5_fp4_mi355x.shTest Plan
kimik2.5-fp4-mi355x-vllmbenchmark suite on MI355X with the updated configEP_SIZE > 1) launches correctly with--enable-expert-parallel--gpu-memory-utilization 0.90and--block-size 1HSA_NO_SCRATCH_RECLAIMguard works