Add a fast path for _clone_dim_order#15815
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15815
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 421c6dc with merge base e774b77 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
3f1cb30 to
929d52b
Compare
|
@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D86993338. |
929d52b to
421c6dc
Compare
|
Note that the moshi failure is pre-existing. |
### Summary Add a direct memcpy fast path for the portable _clone_dim_order op, as it can be a performance bottleneck. I'd like to more aggressively optimize these out of the graph, but this fast path should reduce the perf impact significantly. ### Test plan Existing correctness tests for the _clone_dim_order implementation should cover it. For performance, I did a quick test with a default dim order (1, 128, 256, 256) element tensor on an x86 server. This is mainly intended as a quick smoke test and not a proper benchmark. I included numbers for both optimized and debug builds. Optimized matters more, but super long debug runs can be painful for development. [Optimized Build] Before: 27.9 ms After: 6.4 ms [Debug Build] Before: 5947.01 ms After: 7.2 ms
Summary
Add a direct memcpy fast path for the portable _clone_dim_order op, as it can be a performance bottleneck. I'd like to more aggressively optimize these out of the graph, but this fast path should reduce the perf impact significantly.
Test plan
Existing correctness tests for the _clone_dim_order implementation should cover it.
For performance, I did a quick test with a default dim order (1, 128, 256, 256) element tensor on an x86 server. This is mainly intended as a quick smoke test and not a proper benchmark. I included numbers for both optimized and debug builds. Optimized matters more, but super long debug runs can be painful for development.
[Optimized Build]
Before: 27.9 ms
After: 6.4 ms
[Debug Build]
Before: 5947.01 ms
After: 7.2 ms