Skip to content

Relaxed atomic ordering and AMDGPU-scoped atomics#38

Open
rtmadduri wants to merge 4 commits into
amd-integrationfrom
perf/rtmadduri/atomic-optim
Open

Relaxed atomic ordering and AMDGPU-scoped atomics#38
rtmadduri wants to merge 4 commits into
amd-integrationfrom
perf/rtmadduri/atomic-optim

Conversation

@rtmadduri

@rtmadduri rtmadduri commented May 12, 2026

Copy link
Copy Markdown
Collaborator

Description

The PR rewrites every AMDGPU atomic emission so it lowers as:
atomicrmw <op> ptr %dest, <ty> %val syncscope("agent") monotonic, align N

instead of the C++-default

atomicrmw <op> ptr %dest, <ty> %val seq_cst, align N        // implicit system scope

On gfx942 (MI300), that change skips the s_waitcnt vmcnt(0) lgkmcnt(0) + buffer_inv / buffer_wbinvl1_vol cache-invalidate pair that SequentiallyConsistent + system atomics emit around every global_atomic_*.

File: quadrants/codegen/amdgpu/codegen_amdgpu.cpp

Summary

  • Added private helper llvm::SyncScope::ID amdgpu_atomic_scope() that returns the agent syncscope ID (device-scope, not system-scope).
  • Overrode optimized_reduction, integral_type_atomic, real_type_atomic, and atomic_op_using_cas from the base LLVM codegen.
  • Changed every CreateAtomicRMW and CreateAtomicCmpXchg call from AtomicOrdering::SequentiallyConsistent + default (system) syncscope to AtomicOrdering::Monotonic + agent syncscope.

Rationale

The base LLVM codegen unconditionally emitted the strongest possible atomic primitives: sequentially-consistent ordering (full memory fence semantics) at system scope (visible to host CPU). On AMDGPU this lowers to L2/HBM cache flushes and cross-CU broadcasts on every atomic, even though the required kernels only need device-local visibility (no CPU readers, no cross-stream synchronization in the inner loop).

Two key facts about Quadrants atomics:

The optimized_reduction path only handles atomics flagged is_reduction=true in AtomicOpStmt. Tracing the IR builder revealed that the dominant atomic-heavy kernels (narrowphase contact, broadphase, inequality constraint kernels) construct their atomics through the general atomic path (integral_type_atomic / real_type_atomic), not the reduction path. Overriding only optimized_reduction would have missed them entirely.

The atomic_op_using_cas helper is the fallback for atomic ops with no native instruction; on AMDGPU this is hit for f64 atomics and several integer width combinations. It also needed the relaxed ordering.

Changes

Test scaffolding

  • tests/python/test_atomic_amdgpu.py — 14 end-to-end correctness tests covering every (type, op) combination that the AMDGPU atomic path emits: i32/i64/u32/u64 × {add, min, max, and, or, xor}, f32/f64 × {add, min, max}, i32/f32 × mul, plus the cross-kernel index-allocator pattern. The intra-kernel publish/subscribe pattern is pytest.skip-marked with a pointer to the docs so the contract is greppable.
  • tests/python/test_atomic_amdgpu_ir.py — 8 IR-level tripwires using the existing print_kernel_llvm_ir + subprocess pattern (mirroring test_fn_attrs.py). Each test runs one minimal kernel, dumps the LLVM IR, and asserts on the presence of syncscope("agent") monotonic and the absence of any seq_cst atomicrmw/cmpxchg. This catches regressions that would still produce correct results but cost the cache-flush back.

Base class refactor

Introduced two virtual hooks on TaskCodeGenLLVM:

codegen_llvm.hLines 247-263

  // ...

  virtual llvm::AtomicOrdering default_atomic_ordering() const {
    return llvm::AtomicOrdering::SequentiallyConsistent;
  }

  virtual llvm::SyncScope::ID default_atomic_scope() const {
    return llvm::SyncScope::System;
  }

Threaded them through every atomic emission in TaskCodeGenLLVM::integral_type_atomic, TaskCodeGenLLVM::real_type_atomic, and TaskCodeGenLLVM::atomic_op_using_cas. Deleted the three AMDGPU verbatim copies and replaced them with two-line overrides:

codegen_amdgpu.cppLines 179-184

  llvm::AtomicOrdering default_atomic_ordering() const override {
    return llvm::AtomicOrdering::Monotonic;
  }

  llvm::SyncScope::ID default_atomic_scope() const override {
    return amdgpu_agent_scope_;
  }

Cached the agent scope ID once in the AMDGPU constructor (replacing the per-emission getOrInsertSyncScopeID call). CPU and CUDA codegens are byte-for-byte IR-identical to before.

Correctness fixes

  • Mixed-scope eliminated. Added a third virtual prefer_cas_for_fp_minmax() defaulting to false; AMDGPU overrides to true. In TaskCodeGenLLVM::real_type_atomic, before the runtime-helper fallback for f32/f64 min/max, we now route those ops through atomic_op_using_cas when the virtual is true. The CAS path uses default_atomic_ordering() / default_atomic_scope(), so on AMDGPU qd.atomic_min(f32_field, x) and qd.atomic_add(f32_field, y) both lower with syncscope("agent") monotonic — no more mixed-scope UB.

  • CAS-loop initial load is now atomic. TaskCodeGenLLVM::atomic_op_using_cas now uses load->setAtomic(default_atomic_ordering(), default_atomic_scope()) with an explicit natural alignment instead of a non-atomic load. The LLVM optimizer is no longer permitted to hoist that load out of the CAS retry block on any current or future LLVM version.

@rtmadduri

Copy link
Copy Markdown
Collaborator Author

/run-ci

1 similar comment
@rtmadduri

Copy link
Copy Markdown
Collaborator Author

/run-ci

@rtmadduri

Copy link
Copy Markdown
Collaborator Author

/run-ci

1 similar comment
@rtmadduri

Copy link
Copy Markdown
Collaborator Author

/run-ci

@rtmadduri rtmadduri changed the title implement atomic optimizations Relaxed atomic ordering and AMDGPU-scoped atomics May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant