Reduce allocation overhead in quantized sdpa by kimishpatel · Pull Request #15610 · pytorch/executorch

kimishpatel · 2025-11-05T18:45:13Z

Stack from ghstack (oldest at bottom):

For small models dequantizing portions of v cache causes extra alloc overhead.

Probably a better way to handle this is to dequantize entire v cache outside the model

There isnt significant perf advantage from this yet but subsequent diffs will use caching allocator where this refactor help.

Differential Revision: D85532077

For small models dequantizing portions of v cache causes extra alloc overhead. Probably a better way to handle this is to dequantize entire v cache outside the model There isnt significant perf advantage from this yet but subsequent diffs will use caching allocator where this refactor help. Differential Revision: [D85532077](https://our.internmc.facebook.com/intern/diff/D85532077/) [ghstack-poisoned]

pytorch-bot · 2025-11-05T18:45:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15610

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit 4e529d1 with merge base 8af8252 ():

NEW FAILURE - The following job has failed:

pull / android / run-emulator (gh)
Timeout waiting for emulator to boot.

FLAKY - The following job failed but was likely due to flakiness present on trunk:

Test Metal Backend / export-model-metal-artifact (openai, whisper-large-v3-turbo, non-quantized) / macos-job (gh) (matched macos rule in flaky-rules.json)
File doesn't exist

This comment was automatically generated by Dr. CI and updates every 15 minutes.

For small models dequantizing portions of v cache causes extra alloc overhead. Probably a better way to handle this is to dequantize entire v cache outside the model There isnt significant perf advantage from this yet but subsequent diffs will use caching allocator where this refactor help. Differential Revision: [D85532077](https://our.internmc.facebook.com/intern/diff/D85532077/) [ghstack-poisoned]

Pull Request resolved: #15610 For small models dequantizing portions of v cache causes extra alloc overhead. Probably a better way to handle this is to dequantize entire v cache outside the model There isnt significant perf advantage from this yet but subsequent diffs will use caching allocator where this refactor help. ghstack-source-id: 321455128 @exported-using-ghexport Differential Revision: [D85532077](https://our.internmc.facebook.com/intern/diff/D85532077/)

For small models dequantizing portions of v cache causes extra alloc overhead. Probably a better way to handle this is to dequantize entire v cache outside the model There isnt significant perf advantage from this yet but subsequent diffs will use caching allocator where this refactor help. Differential Revision: [D85532077](https://our.internmc.facebook.com/intern/diff/D85532077/) [ghstack-poisoned]

Copilot

Pull Request Overview

This PR refactors the quantized scaled dot-product attention (SDPA) implementation to reduce allocation overhead by moving the dequantization buffer allocation from inside the dequant_and_gemm function to the outer cpu_flash_attention scope. Instead of allocating a new std::vector for each dequantization operation, a pre-allocated per-thread scratch buffer is now shared across iterations.

Key changes:

Added buf_qdq_ptr parameter to dequant_and_gemm and _qk_at_v_gemm functions to accept externally allocated dequantization buffers
Allocated a shared scratch buffer (scratch_for_quant_dequant) in cpu_flash_attention with per-thread partitioning
Removed the local std::vector<float> dequantized_v_data allocation from dequant_and_gemm

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-17T16:18:24Z

extension/llm/custom_ops/op_sdpa_impl.h

  //    {num_thread, qSplitSize, is_reduced_type ? kvSplitSize : 0},
  //    query.options());
+  int64_t size_per_thread_qdq_vec = qSplitSize * kvSplitSize * headSize;
+  // Lets align size_per_thread_qdq_vec to 64 bytes, for coalesced cache reads,


The comment says "align to 64 bytes" but kAlignment = 32 aligns to 32 elements. Since size_per_thread_qdq_vec is an element count (not byte count), and assuming accum_t is float (4 bytes), this aligns to 128 bytes (32 * 4), not 64 bytes.

Either:

Change kAlignment to 16 if 64-byte alignment is desired, or

Update the comment to say "align to 32 elements" or "align to 128 bytes (for float)"

Suggested change

// Lets align size_per_thread_qdq_vec to 64 bytes, for coalesced cache reads,

// Lets align size_per_thread_qdq_vec to 32 elements (128 bytes for float), for coalesced cache reads,

Copilot · 2025-11-17T16:18:25Z

extension/llm/custom_ops/op_sdpa_impl.h

+  // by padding with right number of per thread elements
+  constexpr int64_t kAlignment = 32;
+  size_per_thread_qdq_vec =
+      (size_per_thread_qdq_vec + kAlignment - 1) & (-(kAlignment - 1));


The alignment calculation is incorrect. The formula (x + kAlignment - 1) & (-(kAlignment - 1)) uses the wrong mask.

For aligning to a power-of-2 boundary, the correct formula is:

(size_per_thread_qdq_vec + kAlignment - 1) & (-kAlignment)

or equivalently:

(size_per_thread_qdq_vec + kAlignment - 1) & ~(kAlignment - 1)

The current code uses -(kAlignment - 1) which equals -31 = 0xFFFFFFE1, but the correct mask should be -32 = 0xFFFFFFE0 to properly zero out the bottom 5 bits.

Suggested change

(size_per_thread_qdq_vec + kAlignment - 1) & (-(kAlignment - 1));

(size_per_thread_qdq_vec + kAlignment - 1) & -kAlignment;

@copilot are you sure? Please double check again

Copilot · 2025-11-17T16:18:25Z

extension/llm/custom_ops/op_sdpa_impl.h

  // at::Tensor buf_reduced = at::empty(
  //    {num_thread, qSplitSize, is_reduced_type ? kvSplitSize : 0},
  //    query.options());
+  int64_t size_per_thread_qdq_vec = qSplitSize * kvSplitSize * headSize;


The buffer size calculation appears to be larger than necessary. The dequantize operation needs kvBlockSize * headSize elements (at most kvSplitSize * headSize), but this allocates qSplitSize * kvSplitSize * headSize. The extra qSplitSize factor seems unnecessary and wastes memory per thread.

Consider changing to:

int64_t size_per_thread_qdq_vec = kvSplitSize * headSize;

Suggested change

int64_t size_per_thread_qdq_vec = qSplitSize * kvSplitSize * headSize;

int64_t size_per_thread_qdq_vec = kvSplitSize * headSize;

Copilot · 2025-11-17T16:45:16Z

@mergennachin I've opened a new pull request, #15852, to work on those changes. Once the pull request is ready, I'll request review from you.

For small models dequantizing portions of v cache causes extra alloc overhead. Probably a better way to handle this is to dequantize entire v cache outside the model There isnt significant perf advantage from this yet but subsequent diffs will use caching allocator where this refactor help. Differential Revision: [D85532077](https://our.internmc.facebook.com/intern/diff/D85532077/) [ghstack-poisoned]

Pull Request resolved: #15610 For small models dequantizing portions of v cache causes extra alloc overhead. Probably a better way to handle this is to dequantize entire v cache outside the model There isnt significant perf advantage from this yet but subsequent diffs will use caching allocator where this refactor help. ghstack-source-id: 324720932 @exported-using-ghexport Differential Revision: [D85532077](https://our.internmc.facebook.com/intern/diff/D85532077/)

For small models dequantizing portions of v cache causes extra alloc overhead. Probably a better way to handle this is to dequantize entire v cache outside the model There isnt significant perf advantage from this yet but subsequent diffs will use caching allocator where this refactor help. Differential Revision: [D85532077](https://our.internmc.facebook.com/intern/diff/D85532077/) [ghstack-poisoned]

@kimishpatel

This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: #15610 by @kimishpatel ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/kimishpatel/202/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/kimishpatel/202/head Merge bot PR base: https://github.com/pytorch/executorch/tree/main Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/kimishpatel/202/orig Differential Revision: [D85532077](https://our.internmc.facebook.com/intern/diff/D85532077/) @diff-train-skip-merge Co-authored-by: Kimish Patel <kimishpatel@fb.com>

@kimishpatel

This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: pytorch#15610 by @kimishpatel ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/kimishpatel/202/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/kimishpatel/202/head Merge bot PR base: https://github.com/pytorch/executorch/tree/main Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/kimishpatel/202/orig Differential Revision: [D85532077](https://our.internmc.facebook.com/intern/diff/D85532077/) @diff-train-skip-merge Co-authored-by: Kimish Patel <kimishpatel@fb.com>

kimishpatel requested review from jackzhxng, larryliu0820 and mergennachin as code owners November 5, 2025 18:45

This was referenced Nov 5, 2025

[Executorch] parallelize op_choose_qparams #15607

Merged

[Executorch] Add simd path for op quantize #15608

Merged

[Executorch] Add multithreading for op_quantize #15609

Merged

[Executorch] Introduce caching cpu memory allocator #15611

Merged

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 5, 2025

meta-codesync bot added fb-exported meta-exported labels Nov 5, 2025

metascroy approved these changes Nov 6, 2025

View reviewed changes

kimishpatel mentioned this pull request Nov 14, 2025

[Executorch] make slice_copy parallel #15830

Merged

mergennachin requested a review from Copilot November 17, 2025 16:12

Copilot AI reviewed Nov 17, 2025

View reviewed changes

Copilot AI mentioned this pull request Nov 17, 2025

Fix incorrect alignment mask in quantized SDPA allocation #15852

Closed

kimishpatel added 2 commits November 21, 2025 07:04

kimishpatel added the release notes: none Do not include this in the release notes label Nov 23, 2025

kimishpatel added 10 commits November 22, 2025 20:45

This was referenced Dec 4, 2025

[Cria][Lllama runner] Use caching temp allocator #16080

Open

[Cria][Lllama runner] Use caching temp allocator #16081

Open

kimishpatel added 4 commits December 4, 2025 08:34

meta-codesync bot merged commit 37078bb into gh/kimishpatel/202/base Dec 6, 2025
163 of 165 checks passed

meta-codesync bot deleted the gh/kimishpatel/202/head branch December 6, 2025 07:20

meta-codesync bot temporarily deployed to cherry-pick-bot December 6, 2025 07:20 Inactive

pytorchbot mentioned this pull request Dec 6, 2025

Reduce allocation overhead in quantized sdpa #16119

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce allocation overhead in quantized sdpa#15610

Reduce allocation overhead in quantized sdpa#15610
meta-codesync[bot] merged 22 commits intogh/kimishpatel/202/basefrom
gh/kimishpatel/202/head

kimishpatel commented Nov 5, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 5, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 17, 2025

Uh oh!

Copilot AI Nov 17, 2025

Uh oh!

mergennachin Nov 17, 2025

Uh oh!

Copilot AI Nov 17, 2025

Uh oh!

Copilot AI commented Nov 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	// Lets align size_per_thread_qdq_vec to 64 bytes, for coalesced cache reads,
	// Lets align size_per_thread_qdq_vec to 32 elements (128 bytes for float), for coalesced cache reads,

	(size_per_thread_qdq_vec + kAlignment - 1) & (-(kAlignment - 1));
	(size_per_thread_qdq_vec + kAlignment - 1) & -kAlignment;

	int64_t size_per_thread_qdq_vec = qSplitSize * kvSplitSize * headSize;
	int64_t size_per_thread_qdq_vec = kvSplitSize * headSize;

Conversation

kimishpatel commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15610

❌ 1 New Failure, 1 Unrelated Failure

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

mergennachin Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Nov 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kimishpatel commented Nov 5, 2025 •

edited

Loading

pytorch-bot bot commented Nov 5, 2025 •

edited

Loading