Add infer_device_arch() for coarse GPU architecture detection#1273
Merged
yueyiming2009 merged 3 commits intoJul 2, 2026
Conversation
infer_device() only returns the device type ("cuda"/"xpu"/"npu"/"cpu"),
which collapses every NVIDIA/AMD GPU into "cuda". This adds
infer_device_arch() to resolve a coarse architecture/generation name so
kernels can branch on hardware (e.g. Blackwell-specific paths):
- NVIDIA: volta_turing / ampere_ada / hopper / blackwell (else sm_<major>0)
- AMD: cdna / cdna2 / cdna3 / rdna3 (else raw gfx target)
- Intel: pvc / arc (else xpu)
- Ascend: ascend910 / ascend310 (else npu)
Falls back to the infer_device() device type when undetectable, is
lru_cached, and distinguishes ROCm (reports as "cuda") from NVIDIA via
torch.version.hip.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
3 tasks
Tcc0403
reviewed
Jun 27, 2026
Split the single "blackwell" label into blackwell / blackwell_ultra / blackwell_consumer so sm_100, sm_103, and sm_120 no longer collapse onto one family name. Fall back to the exact sm_<major><minor> for unknown capabilities instead of zeroing the minor. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Collaborator
|
The current version looks good to me. @Tcc0403 do you think it is good to merge now? |
justinhh4
added a commit
to justinhh4/Liger-Kernel
that referenced
this pull request
Jul 2, 2026
Replace the PR-local get_gpu_arch() helper in ops/utils.py with the infer_device_arch() utility merged in linkedin#1273 (src/liger_kernel/utils.py). The Blackwell gate now reads infer_device_arch().startswith("blackwell"), which covers the whole sm_100+ generation (blackwell / blackwell_ultra / blackwell_consumer) — matching the original major>=10 intent. bf16/fp16 on Blackwell -> 8 warps; fp32 and Hopper/earlier -> 32; AMD -> 16. num_warps is a scheduling-only parameter, so there is no correctness impact. Full CE suite (177 tests, bf16+fp32) passes on B200 (sm_100). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
infer_device()only returns the device type ("cuda"/"xpu"/"npu"/"cpu"), which collapses every NVIDIA and AMD GPU into"cuda"and exposes no way to branch on hardware generation (e.g. Blackwell-specific paths).This PR adds
infer_device_arch()tosrc/liger_kernel/utils.py, which resolves a coarse architecture/generation name:volta_turing/ampere_ada/hopper/blackwell(elsesm_<major>0)gcnArchName):cdna/cdna2/cdna3/rdna3(else the raw gfx target)pvc/arc(elsexpu)ascend910/ascend310(elsenpu)It falls back to the
infer_device()device type when the architecture can't be determined (or on error), islru_cache-d so detection runs once per process, and distinguishes ROCm — which also reports as"cuda"in torch — from NVIDIA viatorch.version.hip.Details
The NVIDIA and AMD paths are reliable. The XPU and NPU paths are best-effort string matching on the device name and were validated via mocking rather than on real Intel/Ascend hardware — the mapping tables (
_AMD_ARCH_BY_GFX,_NVIDIA_ARCH_BY_MAJOR, and the XPU/NPU name keywords) are the spots to refine as new parts ship.Testing Done
Added
test/transformers/test_utils.py(30 cases) covering all four device families' mapping logic, NVIDIA-vs-AMD dispatch, the exception/unaccelerated fallbacks, andlru_cachebehavior. The hardware-specific torch calls are mocked so the full matrix runs on a CPU-only host.make checkstyleto ensure code style (ruff check+ruff format --checkpass on changed files)make testto ensure correctness (ran the newtest/transformers/test_utils.pysuite: 30 passed; fullmake testrequires a GPU + deps not available locally)make test-convergenceto ensure convergence (N/A — no kernel/numerics change)