[Executorch] Use temp allocator for allocating scratch memory by kimishpatel · Pull Request #15728 · pytorch/executorch

kimishpatel · 2025-11-11T04:34:41Z

Stack from ghstack (oldest at bottom):

This allows us to leverage temp memory allocator and if that allocator is caching allocator it reduces the allocaiton overhead.

Differential Revision: D85532076

This allows us to leverage temp memory allocator and if that allocator is caching allocator it reduces the allocaiton overhead. Differential Revision: [D85532076](https://our.internmc.facebook.com/intern/diff/D85532076/) [ghstack-poisoned]

pytorch-bot · 2025-11-11T04:34:45Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15728

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 2 Unrelated Failures

As of commit af048bd with merge base a09a4b7 ():

NEW FAILURES - The following jobs have failed:

pull / test-static-llama-qnn-linux (stories_260k_bc) / linux-job (gh)
RuntimeError: Command docker exec -t bec67b6cc57065a05e31127c12446355e22c85aa8bb286051abb2a5ac029593d /exec failed with exit code 1
pull / unittest-arm-backend-with-no-fvp (test_pytest_models) / linux-job (gh)
RuntimeError: Command docker exec -t d42e42c94b4a15ef37ea4ca6cd1d312906b499d1e6458dd5d245612ccf22806a /exec failed with exit code 1

FLAKY - The following job failed but was likely due to flakiness present on trunk:

periodic / test-models-linux (cmake, vit, xnnpack-quantization-delegation, linux.2xlarge, 90) / linux-job (gh) (matched linux rule in flaky-rules.json)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / android / run-emulator (gh) (trunk failure)
Timeout waiting for emulator to boot.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-11-11T04:36:09Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…ory" This allows us to leverage temp memory allocator and if that allocator is caching allocator it reduces the allocaiton overhead. Differential Revision: [D85532076](https://our.internmc.facebook.com/intern/diff/D85532076/) [ghstack-poisoned]

Copilot

Pull Request Overview

This PR refactors memory allocation in the Flash Attention implementation to use the temporary memory allocator from RuntimeContext instead of stack-allocated std::vector objects. This enables the use of caching allocators when available, reducing allocation overhead.

Adds RuntimeContext& ctx parameter to cpu_flash_attention function
Replaces stack-allocated vectors with ctx.allocate_temp() calls with fallback to heap allocation
Removes unnecessary buf_reduced allocation (dead code for unsupported reduced types)
Updates all call sites to pass the RuntimeContext

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`extension/llm/custom_ops/op_sdpa_impl.h`	Modified `cpu_flash_attention` signature to accept `RuntimeContext`, replaced vector allocations with temp allocator calls
`extension/llm/custom_ops/op_sdpa.cpp`	Updated all call sites (6 locations) to pass `ctx` parameter to `cpu_flash_attention`

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-17T16:18:29Z

extension/llm/custom_ops/op_sdpa_impl.h


 namespace sdpa::impl {

+static std::vector<char> scratch_for_quant_dequant_vec;


This static vector scratch_for_quant_dequant_vec is declared but never used in the code. It appears to be a leftover from the refactoring where the local vector was replaced with the temp allocator approach. This should be removed.

Suggested change

static std::vector<char> scratch_for_quant_dequant_vec;

…ory" This allows us to leverage temp memory allocator and if that allocator is caching allocator it reduces the allocaiton overhead. Differential Revision: [D85532076](https://our.internmc.facebook.com/intern/diff/D85532076/) [ghstack-poisoned]

Pull Request resolved: #15728 This allows us to leverage temp memory allocator and if that allocator is caching allocator it reduces the allocaiton overhead. ghstack-source-id: 325470498 @exported-using-ghexport Differential Revision: [D85532076](https://our.internmc.facebook.com/intern/diff/D85532076/)

…ory" This allows us to leverage temp memory allocator and if that allocator is caching allocator it reduces the allocaiton overhead. Differential Revision: [D85532076](https://our.internmc.facebook.com/intern/diff/D85532076/) [ghstack-poisoned]

Pull Request resolved: #15728 This allows us to leverage temp memory allocator and if that allocator is caching allocator it reduces the allocaiton overhead. ghstack-source-id: 325471916 @exported-using-ghexport Differential Revision: [D85532076](https://our.internmc.facebook.com/intern/diff/D85532076/)

…ory" This allows us to leverage temp memory allocator and if that allocator is caching allocator it reduces the allocaiton overhead. Differential Revision: [D85532076](https://our.internmc.facebook.com/intern/diff/D85532076/) [ghstack-poisoned]

Pull Request resolved: #15728 This allows us to leverage temp memory allocator and if that allocator is caching allocator it reduces the allocaiton overhead. ghstack-source-id: 325655867 @exported-using-ghexport Differential Revision: [D85532076](https://our.internmc.facebook.com/intern/diff/D85532076/)

…ory" This allows us to leverage temp memory allocator and if that allocator is caching allocator it reduces the allocaiton overhead. Differential Revision: [D85532076](https://our.internmc.facebook.com/intern/diff/D85532076/) [ghstack-poisoned]

Pull Request resolved: #15728 This allows us to leverage temp memory allocator and if that allocator is caching allocator it reduces the allocaiton overhead. ghstack-source-id: 327518050 @exported-using-ghexport Differential Revision: [D85532076](https://our.internmc.facebook.com/intern/diff/D85532076/)

@kimishpatel

This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: #15728 by @kimishpatel ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/kimishpatel/211/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/kimishpatel/211/head Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/kimishpatel/203/orig Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/kimishpatel/211/orig Differential Revision: [D85532076](https://our.internmc.facebook.com/intern/diff/D85532076/) @diff-train-skip-merge Co-authored-by: Kimish Patel <kimishpatel@fb.com>

@kimishpatel

…h#16121) This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: pytorch#15728 by @kimishpatel ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/kimishpatel/211/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/kimishpatel/211/head Merge bot PR base: https://github.com/pytorch/executorch/tree/gh/kimishpatel/203/orig Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/kimishpatel/211/orig Differential Revision: [D85532076](https://our.internmc.facebook.com/intern/diff/D85532076/) @diff-train-skip-merge Co-authored-by: Kimish Patel <kimishpatel@fb.com>

kimishpatel requested review from jackzhxng, larryliu0820 and mergennachin as code owners November 11, 2025 04:34

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 11, 2025

meta-codesync bot added fb-exported meta-exported labels Nov 11, 2025

kimishpatel mentioned this pull request Nov 14, 2025

[Executorch] make slice_copy parallel #15830

Merged

mergennachin requested a review from Copilot November 17, 2025 16:15

Copilot started reviewing on behalf of mergennachin November 17, 2025 16:15 View session

Copilot finished reviewing on behalf of mergennachin November 17, 2025 16:18

Copilot AI reviewed Nov 17, 2025

View reviewed changes

kimishpatel added 7 commits November 20, 2025 09:24

kimishpatel added 2 commits November 24, 2025 10:04

kimishpatel added 4 commits November 25, 2025 14:17

This was referenced Dec 4, 2025

[Cria][Lllama runner] Use caching temp allocator #16080

Open

[Cria][Lllama runner] Use caching temp allocator #16081

Open

kimishpatel added 10 commits December 4, 2025 08:34

meta-codesync bot merged commit 765bc5e into gh/kimishpatel/211/base Dec 6, 2025
163 of 167 checks passed

meta-codesync bot deleted the gh/kimishpatel/211/head branch December 6, 2025 07:20

meta-codesync bot temporarily deployed to cherry-pick-bot December 6, 2025 07:20 Inactive

pytorchbot mentioned this pull request Dec 6, 2025

[Executorch] Use temp allocator for allocating scratch memory #16121

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Executorch] Use temp allocator for allocating scratch memory#15728

[Executorch] Use temp allocator for allocating scratch memory#15728
meta-codesync[bot] merged 28 commits intogh/kimishpatel/211/basefrom
gh/kimishpatel/211/head

kimishpatel commented Nov 11, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 11, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 11, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		namespace sdpa::impl {

		static std::vector<char> scratch_for_quant_dequant_vec;

Conversation

kimishpatel commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15728

❌ 2 New Failures, 2 Unrelated Failures

Uh oh!

github-actions bot commented Nov 11, 2025

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kimishpatel commented Nov 11, 2025 •

edited

Loading

pytorch-bot bot commented Nov 11, 2025 •

edited

Loading

This PR needs a `release notes:` label