Skip to content

perf(amdgpu): amdgpu perf + force_inline#15

Merged
yaoliu13 merged 4 commits into
amd-integrationfrom
perf/deepsek/expose_perf_vars
May 3, 2026
Merged

perf(amdgpu): amdgpu perf + force_inline#15
yaoliu13 merged 4 commits into
amd-integrationfrom
perf/deepsek/expose_perf_vars

Conversation

@deepsek

@deepsek deepsek commented Apr 25, 2026

Copy link
Copy Markdown
Collaborator
  • kernel launch improvements
  • force_inline support into the AST

@deepsek deepsek force-pushed the perf/deepsek/expose_perf_vars branch from 7c8ba7e to 789d90d Compare April 26, 2026 15:00
@deepsek deepsek force-pushed the perf/deepsek/expose_perf_vars branch from 789d90d to 95b5708 Compare April 28, 2026 09:27
@deepsek

deepsek commented Apr 28, 2026

Copy link
Copy Markdown
Collaborator Author

/run-ci

@yaoliu13 yaoliu13 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

@yaoliu13

Copy link
Copy Markdown
Collaborator

/run-ci

1 similar comment
@yaoliu13

Copy link
Copy Markdown
Collaborator

/run-ci

@lohiaj

lohiaj commented Apr 29, 2026

Copy link
Copy Markdown

Strong evidence to land. AMDGCN dumps on gfx942 from the Genesis hot kernels show the launcher kernels are thin (≤74 VGPR, 0 scratch) but each calls an outlined function_body via s_swappc_b64 that pays a fixed callee-save prologue/epilogue:

outlined function_body in VGPR AGPR callee-save scratch
func_solve_body_monolith_kernel_1 256 172 1012 B (252 dwords)
func_solve_init_kernel_11 248 32 284 B
func_solve_init_kernel_7 248 32 292 B
kernel_step_1_kernel_26 248 32 240 B

Prologue in each is ~60 contiguous scratch_store_dwords + ~32 v_accvgpr_write_b32s saving v40..v207 in 8-VGPR groups; epilogue mirrors. ~184 prologue/epilogue ops every call.

Confirmed via paired AMDGCN dumps that source-level changes don't clear this floor: e.g. a loop-fusion candidate that cut -690 asm lines from function_body left ΔVGPR/ΔAGPR/Δscratch = 0. Whatever shrinks the body still saves the same registers across the call boundary.

force_inline on the relevant @qd.funcs removes that boundary entirely. Happy to share the full dumps for any of the above kernels if useful

@yaoliu13 @deepsek FYI

@deepsek

deepsek commented Apr 29, 2026

Copy link
Copy Markdown
Collaborator Author

Good eye @lohiaj! That's the main reason I'm exposing this as a variable.. Thanks for validating the same too!
A fun little exercise would also be to see why i'm exposing it as a loop_config decorator instead of qd.func..
In any case, waiting for this to land finally so that the other side of the spectrum can have it's light of day!

@deepsek

deepsek commented Apr 29, 2026

Copy link
Copy Markdown
Collaborator Author

/run-ci

@deepsek deepsek force-pushed the perf/deepsek/expose_perf_vars branch from 9d2a6cb to caf2c1f Compare May 1, 2026 20:20
@deepsek

deepsek commented May 1, 2026

Copy link
Copy Markdown
Collaborator Author

/run-ci

@yaoliu13

yaoliu13 commented May 2, 2026

Copy link
Copy Markdown
Collaborator

1370057 and 4968

@ROCm ROCm deleted a comment from mukh1l May 2, 2026
@ROCm ROCm deleted a comment from deepsek May 2, 2026

@yaoliu13 yaoliu13 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yaoliu13

yaoliu13 commented May 2, 2026

Copy link
Copy Markdown
Collaborator

Need one more approval

@lohiaj lohiaj left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed and approved, force_inline removes the outlined callee save restore boundary i validated on the genesis hot kernels and the launcher hot path cleanup looks clean

@yaoliu13

yaoliu13 commented May 3, 2026

Copy link
Copy Markdown
Collaborator

/run-ci

1 similar comment
@yaoliu13

yaoliu13 commented May 3, 2026

Copy link
Copy Markdown
Collaborator

/run-ci

@yaoliu13

yaoliu13 commented May 3, 2026

Copy link
Copy Markdown
Collaborator

1370243.3 and 4956

@yaoliu13 yaoliu13 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yaoliu13 yaoliu13 merged commit e9a5464 into amd-integration May 3, 2026
39 of 47 checks passed
@deepsek deepsek deleted the perf/deepsek/expose_perf_vars branch May 4, 2026 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants