Skip to content

Add infer_device_arch() for coarse GPU architecture detection#1273

Merged
yueyiming2009 merged 3 commits into
linkedin:mainfrom
yueyiming2009:yueyiming2009/beginner
Jul 2, 2026
Merged

Add infer_device_arch() for coarse GPU architecture detection#1273
yueyiming2009 merged 3 commits into
linkedin:mainfrom
yueyiming2009:yueyiming2009/beginner

Conversation

@yueyiming2009

@yueyiming2009 yueyiming2009 commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator

Summary

infer_device() only returns the device type ("cuda" / "xpu" / "npu" / "cpu"), which collapses every NVIDIA and AMD GPU into "cuda" and exposes no way to branch on hardware generation (e.g. Blackwell-specific paths).

This PR adds infer_device_arch() to src/liger_kernel/utils.py, which resolves a coarse architecture/generation name:

  • NVIDIA (via CUDA compute-capability major): volta_turing / ampere_ada / hopper / blackwell (else sm_<major>0)
  • AMD (via gcnArchName): cdna / cdna2 / cdna3 / rdna3 (else the raw gfx target)
  • Intel XPU (via device name): pvc / arc (else xpu)
  • Ascend NPU (via device name): ascend910 / ascend310 (else npu)

It falls back to the infer_device() device type when the architecture can't be determined (or on error), is lru_cache-d so detection runs once per process, and distinguishes ROCm — which also reports as "cuda" in torch — from NVIDIA via torch.version.hip.

Details

The NVIDIA and AMD paths are reliable. The XPU and NPU paths are best-effort string matching on the device name and were validated via mocking rather than on real Intel/Ascend hardware — the mapping tables (_AMD_ARCH_BY_GFX, _NVIDIA_ARCH_BY_MAJOR, and the XPU/NPU name keywords) are the spots to refine as new parts ship.

Testing Done

Added test/transformers/test_utils.py (30 cases) covering all four device families' mapping logic, NVIDIA-vs-AMD dispatch, the exception/unaccelerated fallbacks, and lru_cache behavior. The hardware-specific torch calls are mocked so the full matrix runs on a CPU-only host.

  • Hardware Type: CPU (Apple Silicon, macOS) — GPU paths exercised via mocks
  • run make checkstyle to ensure code style (ruff check + ruff format --check pass on changed files)
  • run make test to ensure correctness (ran the new test/transformers/test_utils.py suite: 30 passed; full make test requires a GPU + deps not available locally)
  • run make test-convergence to ensure convergence (N/A — no kernel/numerics change)

infer_device() only returns the device type ("cuda"/"xpu"/"npu"/"cpu"),
which collapses every NVIDIA/AMD GPU into "cuda". This adds
infer_device_arch() to resolve a coarse architecture/generation name so
kernels can branch on hardware (e.g. Blackwell-specific paths):

  - NVIDIA: volta_turing / ampere_ada / hopper / blackwell (else sm_<major>0)
  - AMD:    cdna / cdna2 / cdna3 / rdna3 (else raw gfx target)
  - Intel:  pvc / arc (else xpu)
  - Ascend: ascend910 / ascend310 (else npu)

Falls back to the infer_device() device type when undetectable, is
lru_cached, and distinguishes ROCm (reports as "cuda") from NVIDIA via
torch.version.hip.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread src/liger_kernel/utils.py Outdated
Comment thread src/liger_kernel/utils.py
Comment thread src/liger_kernel/utils.py
yueyiming2009 and others added 2 commits June 27, 2026 15:29
Split the single "blackwell" label into blackwell / blackwell_ultra /
blackwell_consumer so sm_100, sm_103, and sm_120 no longer collapse onto
one family name. Fall back to the exact sm_<major><minor> for unknown
capabilities instead of zeroing the minor.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@vaibhavjindal vaibhavjindal left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vaibhavjindal

Copy link
Copy Markdown
Collaborator

The current version looks good to me. @Tcc0403 do you think it is good to merge now?

@yueyiming2009 yueyiming2009 added this pull request to the merge queue Jul 2, 2026
Merged via the queue into linkedin:main with commit 8d03e96 Jul 2, 2026
5 of 7 checks passed
@yueyiming2009 yueyiming2009 deleted the yueyiming2009/beginner branch July 2, 2026 02:28
justinhh4 added a commit to justinhh4/Liger-Kernel that referenced this pull request Jul 2, 2026
Replace the PR-local get_gpu_arch() helper in ops/utils.py with the
infer_device_arch() utility merged in linkedin#1273 (src/liger_kernel/utils.py).

The Blackwell gate now reads infer_device_arch().startswith("blackwell"),
which covers the whole sm_100+ generation (blackwell / blackwell_ultra /
blackwell_consumer) — matching the original major>=10 intent. bf16/fp16 on
Blackwell -> 8 warps; fp32 and Hopper/earlier -> 32; AMD -> 16. num_warps is
a scheduling-only parameter, so there is no correctness impact.

Full CE suite (177 tests, bf16+fp32) passes on B200 (sm_100).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants