Add infer_device_arch() for coarse GPU architecture detection by yueyiming2009 · Pull Request #1273 · linkedin/Liger-Kernel

yueyiming2009 · 2026-06-27T17:43:33Z

Summary

infer_device() only returns the device type ("cuda" / "xpu" / "npu" / "cpu"), which collapses every NVIDIA and AMD GPU into "cuda" and exposes no way to branch on hardware generation (e.g. Blackwell-specific paths).

This PR adds infer_device_arch() to src/liger_kernel/utils.py, which resolves a coarse architecture/generation name:

NVIDIA (via CUDA compute-capability major): volta_turing / ampere_ada / hopper / blackwell (else sm_<major>0)
AMD (via gcnArchName): cdna / cdna2 / cdna3 / rdna3 (else the raw gfx target)
Intel XPU (via device name): pvc / arc (else xpu)
Ascend NPU (via device name): ascend910 / ascend310 (else npu)

It falls back to the infer_device() device type when the architecture can't be determined (or on error), is lru_cache-d so detection runs once per process, and distinguishes ROCm — which also reports as "cuda" in torch — from NVIDIA via torch.version.hip.

Details

The NVIDIA and AMD paths are reliable. The XPU and NPU paths are best-effort string matching on the device name and were validated via mocking rather than on real Intel/Ascend hardware — the mapping tables (_AMD_ARCH_BY_GFX, _NVIDIA_ARCH_BY_MAJOR, and the XPU/NPU name keywords) are the spots to refine as new parts ship.

Testing Done

Added test/transformers/test_utils.py (30 cases) covering all four device families' mapping logic, NVIDIA-vs-AMD dispatch, the exception/unaccelerated fallbacks, and lru_cache behavior. The hardware-specific torch calls are mocked so the full matrix runs on a CPU-only host.

Hardware Type: CPU (Apple Silicon, macOS) — GPU paths exercised via mocks
run make checkstyle to ensure code style (ruff check + ruff format --check pass on changed files)
run make test to ensure correctness (ran the new test/transformers/test_utils.py suite: 30 passed; full make test requires a GPU + deps not available locally)
run make test-convergence to ensure convergence (N/A — no kernel/numerics change)

infer_device() only returns the device type ("cuda"/"xpu"/"npu"/"cpu"), which collapses every NVIDIA/AMD GPU into "cuda". This adds infer_device_arch() to resolve a coarse architecture/generation name so kernels can branch on hardware (e.g. Blackwell-specific paths): - NVIDIA: volta_turing / ampere_ada / hopper / blackwell (else sm_<major>0) - AMD: cdna / cdna2 / cdna3 / rdna3 (else raw gfx target) - Intel: pvc / arc (else xpu) - Ascend: ascend910 / ascend310 (else npu) Falls back to the infer_device() device type when undetectable, is lru_cached, and distinguishes ROCm (reports as "cuda") from NVIDIA via torch.version.hip. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Split the single "blackwell" label into blackwell / blackwell_ultra / blackwell_consumer so sm_100, sm_103, and sm_120 no longer collapse onto one family name. Fall back to the exact sm_<major><minor> for unknown capabilities instead of zeroing the minor. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

vaibhavjindal

LGTM

vaibhavjindal · 2026-06-30T20:19:19Z

The current version looks good to me. @Tcc0403 do you think it is good to merge now?

Replace the PR-local get_gpu_arch() helper in ops/utils.py with the infer_device_arch() utility merged in linkedin#1273 (src/liger_kernel/utils.py). The Blackwell gate now reads infer_device_arch().startswith("blackwell"), which covers the whole sm_100+ generation (blackwell / blackwell_ultra / blackwell_consumer) — matching the original major>=10 intent. bf16/fp16 on Blackwell -> 8 warps; fp32 and Hopper/earlier -> 32; AMD -> 16. num_warps is a scheduling-only parameter, so there is no correctness impact. Full CE suite (177 tests, bf16+fp32) passes on B200 (sm_100). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

yueyiming2009 mentioned this pull request Jun 27, 2026

perf(swiglu): architecture-aware column tiling for Blackwell (B200) #1271

Merged

3 tasks

Tcc0403 reviewed Jun 27, 2026

View reviewed changes

Comment thread src/liger_kernel/utils.py Outdated

Comment thread src/liger_kernel/utils.py

Comment thread src/liger_kernel/utils.py

yueyiming2009 and others added 2 commits June 27, 2026 15:29

Merge branch 'main' into yueyiming2009/beginner

9fc8607

vaibhavjindal approved these changes Jun 30, 2026

View reviewed changes

vaibhavjindal mentioned this pull request Jun 30, 2026

perf(ce): dtype-aware num_warps (Blackwell-gated) #1267

Merged

yueyiming2009 added this pull request to the merge queue Jul 2, 2026

Merged via the queue into linkedin:main with commit 8d03e96 Jul 2, 2026
5 of 7 checks passed

yueyiming2009 deleted the yueyiming2009/beginner branch July 2, 2026 02:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add infer_device_arch() for coarse GPU architecture detection#1273

Add infer_device_arch() for coarse GPU architecture detection#1273
yueyiming2009 merged 3 commits into
linkedin:mainfrom
yueyiming2009:yueyiming2009/beginner

yueyiming2009 commented Jun 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vaibhavjindal left a comment

Uh oh!

vaibhavjindal commented Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

yueyiming2009 commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Testing Done

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vaibhavjindal left a comment

Choose a reason for hiding this comment

Uh oh!

vaibhavjindal commented Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yueyiming2009 commented Jun 27, 2026 •

edited

Loading