[Executorch] make slice_copy parallel#15830
[Executorch] make slice_copy parallel#15830meta-codesync[bot] merged 24 commits intogh/kimishpatel/214/basefrom
Conversation
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15830
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New FailuresAs of commit 4c5d92d with merge base 9cd8402 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) ghstack-source-id: 323355686 Pull Request resolved: #15830
There was a problem hiding this comment.
Pull Request Overview
This pull request adds parallel processing to the slice_copy operation in ExecutorTorch to improve performance during large prefills in LLMs, where slice_copy can take 5-10% of execution time (primarily from rope implementation slicing).
Key Changes:
- Added multithreading support to
compute_slicefunction with workload-based thresholds - Parallel execution distributes work across leading dimensions using
parallel_for - Single-threaded fallback maintained for smaller workloads
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| kernels/portable/cpu/util/targets.bzl | Adds threadpool dependency required for parallel execution support |
| kernels/portable/cpu/util/slice_util.cpp | Implements parallel slice_copy with multithreading when leading_dims ≥ 8 and total_elements ≥ 32768, maintaining single-threaded fallback for smaller workloads |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (use_multithreading) { | ||
| // Use parallel_for to distribute work across leading dimensions | ||
| // Calculate grain size based on number of elements per leading dimension | ||
| const int64_t elements_per_leading_dim = length * trailing_dims; |
There was a problem hiding this comment.
The variable elements_per_leading_dim is calculated but never used. It appears this was intended for grain size calculation but MIN_LEADING_DIMS_FOR_MT is used instead. Consider removing this unused variable or using it to calculate a more dynamic grain size based on workload characteristics.
| const int64_t elements_per_leading_dim = length * trailing_dims; |
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) ghstack-source-id: 324784683
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 324975499 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 325110790 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 325206643 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 325251276 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 325262631 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 326932760 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 326986639 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 327095994 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 327095994 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 327106882 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 327110500 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 327113306 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 327186826 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 327614758 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 327630241 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/) [ghstack-poisoned]
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 327688163 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
1a73b51
into
gh/kimishpatel/214/base
Pull Request resolved: #15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 327688163 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
Pull Request resolved: pytorch#15830 When doing large prefills in LLMs, slice_copy takes about 5-10% time. Mainly coming from slicing in the rope implementation. ghstack-source-id: 327688163 Differential Revision: [D85532081](https://our.internmc.facebook.com/intern/diff/D85532081/)
Stack from ghstack (oldest at bottom):
When doing large prefills in LLMs, slice_copy takes about 5-10% time.
Mainly coming from slicing in the rope implementation.
Differential Revision: D85532081