Fixes to enable using latest ROCm for all quadrants compilation needs by doru1004 · Pull Request #2 · ROCm/quadrants

doru1004 · 2026-04-18T05:27:23Z

This patch enables the latest ROCm build to be used for all the compilation needs of Quadrants including for JIT-ing.

To make this happen export the following paths:

export ROCM_PATH=/path/to/rocm
export LLVM_DIR=/path/to/rocm

yaoliu13

This PR has conflicts if we want to merge it to amd-integration. Please read Confluence page "Genesis PR Review Process".

doru1004 · 2026-05-07T16:42:34Z

Closing this for now. The fix to enable ROCM everywhere should be done differently: by removing the dependency on RTTI being enabled for the JIT compiler.

…eline) Closes the remaining ~12pp post-merge throughput gap on the Genesis G1 8192-env benchmark by removing the only AMDGPU pipeline change that the LLVM-22 upgrade actually destabilized. Diagnosis. After committing #1 (RuntimeContext alignment) and #2 (range_for body / SNode child accessor inlining), the benchmark sat at 1.57M env·steps/s vs. a 1.79M pre-merge baseline. A symbol-level diff between the post-fix wheel and the last-known-good `wheels_align_inline_fix` wheel (which was actually built before sccache served the current jit_amdgpu.cpp.o) showed the pre-merge wheel was missing every `AMDGPUFlatToGlobalLoadStorePass` symbol — i.e. it had never run that pass, full stop. The wheel built from current source crashed inside `qd.init(arch=qd.amdgpu)` with EXIT=139 in `compile_module_to_hsaco` running the runtime bitcode. Root cause. The pass's `originatesFromScratch` walks `Argument` values back through `Function::users()` to inspect each direct caller, recursing into the corresponding `CallBase::getArgOperand(ArgNo)`. Each crossing of a caller boundary resets the visited-set (line 165, `CallerVisited`), so cycles through the `runtime_*` / `LLVMRuntime_*` call graph aren't broken. The pre-LLVM-22 runtime bitcode happened to keep the recursion shallow; post-LLVM-22 (with kernarg-by-value RuntimeContext, more internal helper functions, and a different inlining shape) the same pass blows the stack on init. Disabling the pass entirely is not an option — without it Genesis's solver kernels emit wrong addresses on the constraint-force path and the simulation poisons with NaN. Fix. Skip `AMDGPUFlatToGlobalLoadStorePass` only on the runtime bitcode module — detected by the unique presence of `runtime_initialize` — and keep running it on every user-kernel module. The runtime functions are inlined into user kernels at user-kernel JIT time, so all their loads/stores still get the flat→global lowering, just in the right module shape (the user kernel is well-formed; the runtime BC is a graph of mutually-recursive helpers). Validated end-to-end on Genesis G1 8192-env benchmark (CG, 15 iters, 500 steps, MI300X, FP32): pre-merge baseline: 1,790,770 env·steps/s post-merge, no fixes: 1,386,055 env·steps/s (-22.6%) fixes #1+#2 only: 1,569,491 env·steps/s (-12.4%) fixes #1+#2 + this commit: 1,791,235 env·steps/s (+0.03%) Recovers the full 22.6pp regression. The previously-suspected "kernel function attributes (uniform-work-group-size, waves-per-eu, fast-math) flipped under LLVM 22" follow-up is no longer a gap to close — `compile_module_to_hsaco` already reapplies those defaults to AMDGPU_KERNEL functions (lines 75-86), and post this commit the benchmark sits on the baseline regardless. Leaving them as-is. Co-authored-by: Cursor <cursoragent@cursor.com>

doru1004 added 2 commits April 18, 2026 00:23

Fixes to enable using latest ROCm for all quadrants compilation needs

2d468c9

Set vector interleave to 8

ea3e83c

gpinkert mentioned this pull request Apr 23, 2026

Perf/async hip memcpy l1 #9

Closed

yaoliu13 requested changes May 3, 2026

View reviewed changes

doru1004 closed this May 7, 2026

lohiaj mentioned this pull request May 11, 2026

perf(amdgpu): skip arg-buffer H2D when host bytes match prior launch #28

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixes to enable using latest ROCm for all quadrants compilation needs#2

Fixes to enable using latest ROCm for all quadrants compilation needs#2
doru1004 wants to merge 2 commits into
amd-integrationfrom
amd-integration-fixes-for-rocm

doru1004 commented Apr 18, 2026

Uh oh!

yaoliu13 left a comment

Uh oh!

doru1004 commented May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

doru1004 commented Apr 18, 2026

Uh oh!

yaoliu13 left a comment

Choose a reason for hiding this comment

Uh oh!

doru1004 commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

doru1004 commented May 7, 2026 •

edited

Loading