Optimize w8a8 pallas kernel #9473

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

lsy323 merged 1 commit into pytorch:master from kyuyeunk:optimize-w8a8-pallas-kernel

Jul 16, 2025

Contributor

kyuyeunk commented Jul 11, 2025 •

edited

Loading

This PR optimizes performance of quantized matmul kernel using following optimizations

Minimize run-time branching logic by creating multiple functions during compile time
Use bf16 during input quantization
If possible, cache input quantization result for later use to reduce re-quantization
Only save accumulation output to scratch memory when necessary
Only create scratch memory when necessary

vanbasten23 self-requested a review

July 11, 2025 17:45

vanbasten23 reviewed

View reviewed changes

torch_xla/experimental/pallas_kernels/quantized_matmul_kernel.py Show resolved Hide resolved

vanbasten23 reviewed

View reviewed changes

torch_xla/experimental/pallas_kernels/quantized_matmul_kernel.py Outdated Show resolved Hide resolved

vanbasten23 reviewed

View reviewed changes

torch_xla/experimental/pallas_kernels/quantized_matmul_kernel.py Show resolved Hide resolved

vanbasten23 reviewed

View reviewed changes

torch_xla/experimental/pallas_kernels/quantized_matmul_kernel.py Show resolved Hide resolved

vanbasten23 reviewed

View reviewed changes

torch_xla/experimental/pallas_kernels/quantized_matmul_kernel.py Outdated Show resolved Hide resolved

vanbasten23 reviewed

View reviewed changes

torch_xla/experimental/pallas_kernels/quantized_matmul_kernel.py Show resolved Hide resolved

vanbasten23 reviewed

View reviewed changes

torch_xla/experimental/pallas_kernels/quantized_matmul_kernel.py Show resolved Hide resolved

kyuyeunk force-pushed the optimize-w8a8-pallas-kernel branch 7 times, most recently from c246fa6 to 9a965b0 Compare

July 14, 2025 21:56

vanbasten23 approved these changes

View reviewed changes

Collaborator

vanbasten23 left a comment

Looks good. Thanks @kyuyeunk
Will merge once the TPU CI finishes.

Collaborator

vanbasten23 commented Jul 15, 2025

The TPU CI seems blocking the merge.

Contributor Author

kyuyeunk commented Jul 15, 2025

The TPU CI seems blocking the merge.

Do you know what is wrong & how to resolve it?


          Optimize w8a8 pallas kernel

bdc7787

Adds following optimizations
- Minimize run-time branching logic by creating multiple functions during compile time
- Use bf16 during input quantization
- If possible, cache input quantization result for later use to reduce re-quantization
- Only save accumulation output to scratch memory when necessary
- Only create scratch memory when necessary

kyuyeunk force-pushed the optimize-w8a8-pallas-kernel branch from 9a965b0 to bdc7787 Compare

July 15, 2025 17:59

lsy323 merged commit 36ff641 into pytorch:master

23 of 24 checks passed

kyuyeunk deleted the optimize-w8a8-pallas-kernel branch

July 16, 2025 16:44

kyuyeunk mentioned this pull request

Implement performance optimized w8a8 pallas kernel vllm-project/tpu-inference#243

Merged

3 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet