Relaxed atomic ordering and AMDGPU-scoped atomics#38
Open
rtmadduri wants to merge 4 commits into
Open
Conversation
Collaborator
Author
|
/run-ci |
1 similar comment
Collaborator
Author
|
/run-ci |
Collaborator
Author
|
/run-ci |
1 similar comment
Collaborator
Author
|
/run-ci |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The PR rewrites every AMDGPU atomic emission so it lowers as:
atomicrmw <op> ptr %dest, <ty> %val syncscope("agent") monotonic, align Ninstead of the C++-default
atomicrmw <op> ptr %dest, <ty> %val seq_cst, align N// implicit system scopeOn
gfx942 (MI300), that change skips thes_waitcnt vmcnt(0) lgkmcnt(0)+buffer_inv / buffer_wbinvl1_volcache-invalidate pair thatSequentiallyConsistent + systematomics emit around everyglobal_atomic_*.File: quadrants/codegen/amdgpu/codegen_amdgpu.cppSummary
llvm::SyncScope::ID amdgpu_atomic_scope()that returns theagentsyncscope ID (device-scope, not system-scope).optimized_reduction,integral_type_atomic,real_type_atomic, andatomic_op_using_casfrom the base LLVM codegen.CreateAtomicRMWandCreateAtomicCmpXchgcall fromAtomicOrdering::SequentiallyConsistent+ default (system) syncscope toAtomicOrdering::Monotonic+agentsyncscope.Rationale
The base LLVM codegen unconditionally emitted the strongest possible atomic primitives: sequentially-consistent ordering (full memory fence semantics) at system scope (visible to host CPU). On AMDGPU this lowers to L2/HBM cache flushes and cross-CU broadcasts on every atomic, even though the required kernels only need device-local visibility (no CPU readers, no cross-stream synchronization in the inner loop).
Two key facts about Quadrants atomics:
The
optimized_reductionpath only handles atomics flaggedis_reduction=trueinAtomicOpStmt. Tracing the IR builder revealed that the dominant atomic-heavy kernels (narrowphase contact, broadphase, inequality constraint kernels) construct their atomics through the general atomic path(integral_type_atomic / real_type_atomic), not the reduction path. Overriding onlyoptimized_reductionwould have missed them entirely.The
atomic_op_using_cashelper is the fallback for atomic ops with no native instruction; on AMDGPU this is hit for f64 atomics and several integer width combinations. It also needed the relaxed ordering.Changes
Test scaffolding
tests/python/test_atomic_amdgpu.py— 14 end-to-end correctness tests covering every (type, op) combination that the AMDGPU atomic path emits: i32/i64/u32/u64 × {add, min, max, and, or, xor}, f32/f64 × {add, min, max}, i32/f32 × mul, plus the cross-kernel index-allocator pattern. The intra-kernel publish/subscribe pattern is pytest.skip-marked with a pointer to the docs so the contract is greppable.tests/python/test_atomic_amdgpu_ir.py— 8 IR-level tripwires using the existing print_kernel_llvm_ir + subprocess pattern (mirroring test_fn_attrs.py). Each test runs one minimal kernel, dumps the LLVM IR, and asserts on the presence of syncscope("agent") monotonic and the absence of any seq_cst atomicrmw/cmpxchg. This catches regressions that would still produce correct results but cost the cache-flush back.Base class refactor
Introduced two virtual hooks on
TaskCodeGenLLVM:codegen_llvm.hLines 247-263
Threaded them through every atomic emission in
TaskCodeGenLLVM::integral_type_atomic,TaskCodeGenLLVM::real_type_atomic, andTaskCodeGenLLVM::atomic_op_using_cas. Deleted the three AMDGPU verbatim copies and replaced them with two-line overrides:codegen_amdgpu.cppLines 179-184
Cached the
agentscope ID once in the AMDGPU constructor (replacing the per-emissiongetOrInsertSyncScopeIDcall). CPU and CUDA codegens are byte-for-byte IR-identical to before.Correctness fixes
Mixed-scope eliminated. Added a third virtual
prefer_cas_for_fp_minmax()defaulting to false; AMDGPU overrides totrue. InTaskCodeGenLLVM::real_type_atomic, before the runtime-helper fallback forf32/f64 min/max, we now route those ops throughatomic_op_using_caswhen the virtual is true. The CAS path usesdefault_atomic_ordering() / default_atomic_scope(), so on AMDGPUqd.atomic_min(f32_field, x)andqd.atomic_add(f32_field, y)both lower withsyncscope("agent")monotonic — no more mixed-scope UB.CAS-loop initial load is now atomic.
TaskCodeGenLLVM::atomic_op_using_casnow usesload->setAtomic(default_atomic_ordering(), default_atomic_scope())with an explicit natural alignment instead of a non-atomic load. The LLVM optimizer is no longer permitted to hoist that load out of the CAS retry block on any current or future LLVM version.