Skip to content

Fixes to enable using latest ROCm for all quadrants compilation needs#2

Closed
doru1004 wants to merge 2 commits into
amd-integrationfrom
amd-integration-fixes-for-rocm
Closed

Fixes to enable using latest ROCm for all quadrants compilation needs#2
doru1004 wants to merge 2 commits into
amd-integrationfrom
amd-integration-fixes-for-rocm

Conversation

@doru1004

Copy link
Copy Markdown

This patch enables the latest ROCm build to be used for all the compilation needs of Quadrants including for JIT-ing.

To make this happen export the following paths:

export ROCM_PATH=/path/to/rocm
export LLVM_DIR=/path/to/rocm

@gpinkert gpinkert mentioned this pull request Apr 23, 2026

@yaoliu13 yaoliu13 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has conflicts if we want to merge it to amd-integration. Please read Confluence page "Genesis PR Review Process".

@doru1004

doru1004 commented May 7, 2026

Copy link
Copy Markdown
Author

Closing this for now. The fix to enable ROCM everywhere should be done differently: by removing the dependency on RTTI being enabled for the JIT compiler.

@doru1004 doru1004 closed this May 7, 2026
npoulad1 added a commit that referenced this pull request May 9, 2026
…eline)

Closes the remaining ~12pp post-merge throughput gap on the Genesis G1
8192-env benchmark by removing the only AMDGPU pipeline change that the
LLVM-22 upgrade actually destabilized.

Diagnosis. After committing #1 (RuntimeContext alignment) and #2
(range_for body / SNode child accessor inlining), the benchmark sat at
1.57M env·steps/s vs. a 1.79M pre-merge baseline. A symbol-level diff
between the post-fix wheel and the last-known-good `wheels_align_inline_fix`
wheel (which was actually built before sccache served the current
jit_amdgpu.cpp.o) showed the pre-merge wheel was missing every
`AMDGPUFlatToGlobalLoadStorePass` symbol — i.e. it had never run that
pass, full stop. The wheel built from current source crashed inside
`qd.init(arch=qd.amdgpu)` with EXIT=139 in
`compile_module_to_hsaco` running the runtime bitcode.

Root cause. The pass's `originatesFromScratch` walks `Argument` values
back through `Function::users()` to inspect each direct caller, recursing
into the corresponding `CallBase::getArgOperand(ArgNo)`. Each crossing
of a caller boundary resets the visited-set (line 165, `CallerVisited`),
so cycles through the `runtime_*` / `LLVMRuntime_*` call graph aren't
broken. The pre-LLVM-22 runtime bitcode happened to keep the recursion
shallow; post-LLVM-22 (with kernarg-by-value RuntimeContext, more
internal helper functions, and a different inlining shape) the same
pass blows the stack on init. Disabling the pass entirely is not an
option — without it Genesis's solver kernels emit wrong addresses on
the constraint-force path and the simulation poisons with NaN.

Fix. Skip `AMDGPUFlatToGlobalLoadStorePass` only on the runtime
bitcode module — detected by the unique presence of `runtime_initialize`
— and keep running it on every user-kernel module. The runtime
functions are inlined into user kernels at user-kernel JIT time, so
all their loads/stores still get the flat→global lowering, just in
the right module shape (the user kernel is well-formed; the runtime
BC is a graph of mutually-recursive helpers).

Validated end-to-end on Genesis G1 8192-env benchmark (CG, 15 iters,
500 steps, MI300X, FP32):

  pre-merge baseline:           1,790,770 env·steps/s
  post-merge, no fixes:         1,386,055 env·steps/s   (-22.6%)
  fixes #1+#2 only:             1,569,491 env·steps/s   (-12.4%)
  fixes #1+#2 + this commit:    1,791,235 env·steps/s   (+0.03%)

Recovers the full 22.6pp regression. The previously-suspected
"kernel function attributes (uniform-work-group-size, waves-per-eu,
fast-math) flipped under LLVM 22" follow-up is no longer a gap to
close — `compile_module_to_hsaco` already reapplies those defaults to
AMDGPU_KERNEL functions (lines 75-86), and post this commit the
benchmark sits on the baseline regardless. Leaving them as-is.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants