Relaxed atomic ordering and AMDGPU-scoped atomics by rtmadduri · Pull Request #38 · ROCm/quadrants

rtmadduri · 2026-05-12T04:13:13Z

Description

The PR rewrites every AMDGPU atomic emission so it lowers as:
atomicrmw <op> ptr %dest, <ty> %val syncscope("agent") monotonic, align N

instead of the C++-default

atomicrmw <op> ptr %dest, <ty> %val seq_cst, align N // implicit system scope

On gfx942 (MI300), that change skips the s_waitcnt vmcnt(0) lgkmcnt(0) + buffer_inv / buffer_wbinvl1_vol cache-invalidate pair that SequentiallyConsistent + system atomics emit around every global_atomic_*.

File: quadrants/codegen/amdgpu/codegen_amdgpu.cpp

Summary

Added private helper llvm::SyncScope::ID amdgpu_atomic_scope() that returns the agent syncscope ID (device-scope, not system-scope).
Overrode optimized_reduction, integral_type_atomic, real_type_atomic, and atomic_op_using_cas from the base LLVM codegen.
Changed every CreateAtomicRMW and CreateAtomicCmpXchg call from AtomicOrdering::SequentiallyConsistent + default (system) syncscope to AtomicOrdering::Monotonic + agent syncscope.

Rationale

The base LLVM codegen unconditionally emitted the strongest possible atomic primitives: sequentially-consistent ordering (full memory fence semantics) at system scope (visible to host CPU). On AMDGPU this lowers to L2/HBM cache flushes and cross-CU broadcasts on every atomic, even though the required kernels only need device-local visibility (no CPU readers, no cross-stream synchronization in the inner loop).

Two key facts about Quadrants atomics:

The optimized_reduction path only handles atomics flagged is_reduction=true in AtomicOpStmt. Tracing the IR builder revealed that the dominant atomic-heavy kernels (narrowphase contact, broadphase, inequality constraint kernels) construct their atomics through the general atomic path (integral_type_atomic / real_type_atomic), not the reduction path. Overriding only optimized_reduction would have missed them entirely.

The atomic_op_using_cas helper is the fallback for atomic ops with no native instruction; on AMDGPU this is hit for f64 atomics and several integer width combinations. It also needed the relaxed ordering.

Changes

Test scaffolding

tests/python/test_atomic_amdgpu.py — 14 end-to-end correctness tests covering every (type, op) combination that the AMDGPU atomic path emits: i32/i64/u32/u64 × {add, min, max, and, or, xor}, f32/f64 × {add, min, max}, i32/f32 × mul, plus the cross-kernel index-allocator pattern. The intra-kernel publish/subscribe pattern is pytest.skip-marked with a pointer to the docs so the contract is greppable.
tests/python/test_atomic_amdgpu_ir.py — 8 IR-level tripwires using the existing print_kernel_llvm_ir + subprocess pattern (mirroring test_fn_attrs.py). Each test runs one minimal kernel, dumps the LLVM IR, and asserts on the presence of syncscope("agent") monotonic and the absence of any seq_cst atomicrmw/cmpxchg. This catches regressions that would still produce correct results but cost the cache-flush back.

Base class refactor

Introduced two virtual hooks on TaskCodeGenLLVM:

codegen_llvm.hLines 247-263

  // ...

  virtual llvm::AtomicOrdering default_atomic_ordering() const {
    return llvm::AtomicOrdering::SequentiallyConsistent;
  }

  virtual llvm::SyncScope::ID default_atomic_scope() const {
    return llvm::SyncScope::System;
  }

Threaded them through every atomic emission in TaskCodeGenLLVM::integral_type_atomic, TaskCodeGenLLVM::real_type_atomic, and TaskCodeGenLLVM::atomic_op_using_cas. Deleted the three AMDGPU verbatim copies and replaced them with two-line overrides:

codegen_amdgpu.cppLines 179-184

  llvm::AtomicOrdering default_atomic_ordering() const override {
    return llvm::AtomicOrdering::Monotonic;
  }

  llvm::SyncScope::ID default_atomic_scope() const override {
    return amdgpu_agent_scope_;
  }

Cached the agent scope ID once in the AMDGPU constructor (replacing the per-emission getOrInsertSyncScopeID call). CPU and CUDA codegens are byte-for-byte IR-identical to before.

Correctness fixes

Mixed-scope eliminated. Added a third virtual prefer_cas_for_fp_minmax() defaulting to false; AMDGPU overrides to true. In TaskCodeGenLLVM::real_type_atomic, before the runtime-helper fallback for f32/f64 min/max, we now route those ops through atomic_op_using_cas when the virtual is true. The CAS path uses default_atomic_ordering() / default_atomic_scope(), so on AMDGPU qd.atomic_min(f32_field, x) and qd.atomic_add(f32_field, y) both lower with syncscope("agent") monotonic — no more mixed-scope UB.
CAS-loop initial load is now atomic. TaskCodeGenLLVM::atomic_op_using_cas now uses load->setAtomic(default_atomic_ordering(), default_atomic_scope()) with an explicit natural alignment instead of a non-atomic load. The LLVM optimizer is no longer permitted to hoist that load out of the CAS retry block on any current or future LLVM version.

rtmadduri · 2026-05-14T00:32:35Z

/run-ci

rtmadduri · 2026-05-14T00:32:43Z

/run-ci

rtmadduri · 2026-05-14T18:39:17Z

/run-ci

rtmadduri · 2026-05-14T21:01:59Z

/run-ci

implement atomic optimizations

9b568f7

remove amdgpu-flat-work-group-size changes

f4886cd

rtmadduri added 2 commits May 15, 2026 14:43

fix the security issues

fd0df6d

fix a broken codegen_amd

6546712

rtmadduri changed the title ~~implement atomic optimizations~~ Relaxed atomic ordering and AMDGPU-scoped atomics May 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Relaxed atomic ordering and AMDGPU-scoped atomics#38

Relaxed atomic ordering and AMDGPU-scoped atomics#38
rtmadduri wants to merge 4 commits into
amd-integrationfrom
perf/rtmadduri/atomic-optim

rtmadduri commented May 12, 2026 •

edited

Loading

Uh oh!

rtmadduri commented May 14, 2026

Uh oh!

rtmadduri commented May 14, 2026

Uh oh!

rtmadduri commented May 14, 2026

Uh oh!

rtmadduri commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

rtmadduri commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary

Rationale

Changes

Test scaffolding

Base class refactor

Correctness fixes

Uh oh!

rtmadduri commented May 14, 2026

Uh oh!

rtmadduri commented May 14, 2026

Uh oh!

rtmadduri commented May 14, 2026

Uh oh!

rtmadduri commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rtmadduri commented May 12, 2026 •

edited

Loading