Calculating axis's local wg size based on global workload and making it as close as possible to warp size of 32.#6409
Merged
facebook-github-bot merged 1 commit intopytorch:mainfrom Oct 22, 2024
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/6409
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit eaf3bf9 with merge base 8c96805 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D64418632 |
SS-JIA
approved these changes
Oct 21, 2024
trviv
added a commit
to trviv/executorch
that referenced
this pull request
Oct 22, 2024
…it as close as possible to warp size of 32. (pytorch#6409) Summary: This diff changes the local workgroup size calculation logic in the Vulkan backend of Executorch. The workgroup size of the largest axis is kept largest so workgroups are better occupied. The workgroup size is calculated based on the warp size of 32. When kernel is 2 dimensional largest axis is kept close to warp size it, so threads in the same warp Read / Write to consecutive memory locations, thus improving performance. Reviewed By: SS-JIA Differential Revision: D64418632
59f7b82 to
0c538c4
Compare
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D64418632 |
trviv
added a commit
to trviv/executorch
that referenced
this pull request
Oct 22, 2024
…it as close as possible to warp size of 32. (pytorch#6409) Summary: This diff changes the local workgroup size calculation logic in the Vulkan backend of Executorch. The workgroup size of the largest axis is kept largest so workgroups are better occupied. The workgroup size is calculated based on the warp size of 32. When kernel is 2 dimensional largest axis is kept close to warp size it, so threads in the same warp Read / Write to consecutive memory locations, thus improving performance. Reviewed By: SS-JIA Differential Revision: D64418632
0c538c4 to
b79149b
Compare
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D64418632 |
trviv
added a commit
to trviv/executorch
that referenced
this pull request
Oct 22, 2024
…it as close as possible to warp size of 32. (pytorch#6409) Summary: This diff changes the local workgroup size calculation logic in the Vulkan backend of Executorch. The workgroup size of the largest axis is kept largest so workgroups are better occupied. The workgroup size is calculated based on the warp size of 32. When kernel is 2 dimensional largest axis is kept close to warp size it, so threads in the same warp Read / Write to consecutive memory locations, thus improving performance. Reviewed By: SS-JIA Differential Revision: D64418632
b79149b to
615d411
Compare
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D64418632 |
trviv
added a commit
to trviv/executorch
that referenced
this pull request
Oct 22, 2024
…it as close as possible to warp size of 32. (pytorch#6409) Summary: This diff changes the local workgroup size calculation logic in the Vulkan backend of Executorch. The workgroup size of the largest axis is kept largest so workgroups are better occupied. The workgroup size is calculated based on the warp size of 32. When kernel is 2 dimensional largest axis is kept close to warp size it, so threads in the same warp Read / Write to consecutive memory locations, thus improving performance. Reviewed By: SS-JIA Differential Revision: D64418632
615d411 to
6d93b0b
Compare
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D64418632 |
…it as close as possible to warp size of 32. (pytorch#6409) Summary: This diff changes the local workgroup size calculation logic in the Vulkan backend of Executorch. The workgroup size of the largest axis is kept largest so workgroups are better occupied. The workgroup size is calculated based on the warp size of 32. When kernel is 2 dimensional largest axis is kept close to warp size it, so threads in the same warp Read / Write to consecutive memory locations, thus improving performance. Reviewed By: SS-JIA Differential Revision: D64418632
6d93b0b to
eaf3bf9
Compare
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D64418632 |
SS-JIA
added a commit
that referenced
this pull request
Aug 7, 2025
## Context
Currently `default_pick_local_wg_size()` (which internally calls `ComputeGraph::create_local_wg_size`) is used to select the local work group size for matrix multiplication ops. However, these functions currently bias the size of the local work group towards the largest dim of the global work group producing local wg sizes like
```
shader globalwg size localwg size
=========== ===================== ==================== =============
linear_qga4w_tiled_texture3d_texture3d_texture2d_float {256, 29, 1} {32, 2, 1} 1487
matmul_naive_texture3d_float {29, 115, 32} {4, 2, 8} 712
```
for matrix multiplication shaders. This behaviour was introduced in D64418632 / #6409.
However, through experimental testing a "square" work group size of `{8, 8, 1}` works a lot better for matrix multiplication shaders. The theoretical analysis for this behaviour is that the local work group size determines the memory locations that need to be loaded to compute the overall work group. For a work group with size `{W, H, 1}` the data required to compute the output would be `W * OUTPUT_TILE_W` columns of the weight tensor and `H * OUTPUT_TILE_H` rows of the input tensor. Note that all work group items in the same W index will be requesting the same columns from the weight tensor, and all work group items in the same H index will be requesting the same rows from the input tensor.
If `H==W`, then that "balances" the amount of data needed to loaded from each input tensor and may result in better data sharing behaviour among all work group items. Assuming `OUTPUT_TILE_W == OUTPUT_TILE_H == 1`, a local work group of size `{64, 1, 1}` would require 1 unique row from the input tensor an 64 unique columns to be loaded from the weight tensor, resulting in `(1 + 64) * K = 65K` elements to be loaded in total, where K is the size of the shared reduction dim. Conversely, a local work group of size `{8, 8, 1}` would require 8 unique rows / 8 unique columns resulting in only `(8 + 8) * K = 16K` unique elements to be loaded.
This highlights the need to use dedicated logic to compute work group sizes for matrix multiplication shaders.
## Changes
* Introduce `pick_hw_square_wg_size`
* Use the new local work group size determination function for Quantized Linear, Matmul, and Linear
Differential Revision: [D79813236](https://our.internmc.facebook.com/intern/diff/D79813236/)
[ghstack-poisoned]
SS-JIA
added a commit
that referenced
this pull request
Aug 7, 2025
## Context
Currently `default_pick_local_wg_size()` (which internally calls `ComputeGraph::create_local_wg_size`) is used to select the local work group size for matrix multiplication ops. However, these functions currently bias the size of the local work group towards the largest dim of the global work group producing local wg sizes like
```
shader globalwg size localwg size
=========== ===================== ==================== =============
linear_qga4w_tiled_texture3d_texture3d_texture2d_float {256, 29, 1} {32, 2, 1} 1487
matmul_naive_texture3d_float {29, 115, 32} {4, 2, 8} 712
```
for matrix multiplication shaders. This behaviour was introduced in D64418632 / #6409.
However, through experimental testing a "square" work group size of `{8, 8, 1}` works a lot better for matrix multiplication shaders. The theoretical analysis for this behaviour is that the local work group size determines the memory locations that need to be loaded to compute the overall work group. For a work group with size `{W, H, 1}` the data required to compute the output would be `W * OUTPUT_TILE_W` columns of the weight tensor and `H * OUTPUT_TILE_H` rows of the input tensor. Note that all work group items in the same W index will be requesting the same columns from the weight tensor, and all work group items in the same H index will be requesting the same rows from the input tensor.
If `H==W`, then that "balances" the amount of data needed to loaded from each input tensor and may result in better data sharing behaviour among all work group items. Assuming `OUTPUT_TILE_W == OUTPUT_TILE_H == 1`, a local work group of size `{64, 1, 1}` would require 1 unique row from the input tensor an 64 unique columns to be loaded from the weight tensor, resulting in `(1 + 64) * K = 65K` elements to be loaded in total, where K is the size of the shared reduction dim. Conversely, a local work group of size `{8, 8, 1}` would require 8 unique rows / 8 unique columns resulting in only `(8 + 8) * K = 16K` unique elements to be loaded.
This highlights the need to use dedicated logic to compute work group sizes for matrix multiplication shaders.
## Changes
* Introduce `pick_hw_square_wg_size`
* Use the new local work group size determination function for Quantized Linear, Matmul, and Linear
Differential Revision: [D79813236](https://our.internmc.facebook.com/intern/diff/D79813236/)
ghstack-source-id: 301415132
Pull Request resolved: #13185
SS-JIA
pushed a commit
that referenced
this pull request
Aug 11, 2025
## Context
Currently `default_pick_local_wg_size()` (which internally calls `ComputeGraph::create_local_wg_size`) is used to select the local work group size for matrix multiplication ops. However, these functions currently bias the size of the local work group towards the largest dim of the global work group producing local wg sizes like
```
shader globalwg size localwg size
=========== ===================== ==================== =============
linear_qga4w_tiled_texture3d_texture3d_texture2d_float {256, 29, 1} {32, 2, 1} 1487
matmul_naive_texture3d_float {29, 115, 32} {4, 2, 8} 712
```
for matrix multiplication shaders. This behaviour was introduced in D64418632 / #6409.
However, through experimental testing a "square" work group size of `{8, 8, 1}` works a lot better for matrix multiplication shaders. The theoretical analysis for this behaviour is that the local work group size determines the memory locations that need to be loaded to compute the overall work group. For a work group with size `{W, H, 1}` the data required to compute the output would be `W * OUTPUT_TILE_W` columns of the weight tensor and `H * OUTPUT_TILE_H` rows of the input tensor. Note that all work group items in the same W index will be requesting the same columns from the weight tensor, and all work group items in the same H index will be requesting the same rows from the input tensor.
If `H==W`, then that "balances" the amount of data needed to loaded from each input tensor and may result in better data sharing behaviour among all work group items. Assuming `OUTPUT_TILE_W == OUTPUT_TILE_H == 1`, a local work group of size `{64, 1, 1}` would require 1 unique row from the input tensor an 64 unique columns to be loaded from the weight tensor, resulting in `(1 + 64) * K = 65K` elements to be loaded in total, where K is the size of the shared reduction dim. Conversely, a local work group of size `{8, 8, 1}` would require 8 unique rows / 8 unique columns resulting in only `(8 + 8) * K = 16K` unique elements to be loaded.
This highlights the need to use dedicated logic to compute work group sizes for matrix multiplication shaders.
## Changes
* Introduce `pick_hw_square_wg_size`
* Use the new local work group size determination function for Quantized Linear, Matmul, and Linear
Differential Revision: [D79813236](https://our.internmc.facebook.com/intern/diff/D79813236/)
[ghstack-poisoned]
SS-JIA
pushed a commit
that referenced
this pull request
Aug 11, 2025
## Context
Currently `default_pick_local_wg_size()` (which internally calls `ComputeGraph::create_local_wg_size`) is used to select the local work group size for matrix multiplication ops. However, these functions currently bias the size of the local work group towards the largest dim of the global work group producing local wg sizes like
```
shader globalwg size localwg size
=========== ===================== ==================== =============
linear_qga4w_tiled_texture3d_texture3d_texture2d_float {256, 29, 1} {32, 2, 1} 1487
matmul_naive_texture3d_float {29, 115, 32} {4, 2, 8} 712
```
for matrix multiplication shaders. This behaviour was introduced in D64418632 / #6409.
However, through experimental testing a "square" work group size of `{8, 8, 1}` works a lot better for matrix multiplication shaders. The theoretical analysis for this behaviour is that the local work group size determines the memory locations that need to be loaded to compute the overall work group. For a work group with size `{W, H, 1}` the data required to compute the output would be `W * OUTPUT_TILE_W` columns of the weight tensor and `H * OUTPUT_TILE_H` rows of the input tensor. Note that all work group items in the same W index will be requesting the same columns from the weight tensor, and all work group items in the same H index will be requesting the same rows from the input tensor.
If `H==W`, then that "balances" the amount of data needed to loaded from each input tensor and may result in better data sharing behaviour among all work group items. Assuming `OUTPUT_TILE_W == OUTPUT_TILE_H == 1`, a local work group of size `{64, 1, 1}` would require 1 unique row from the input tensor an 64 unique columns to be loaded from the weight tensor, resulting in `(1 + 64) * K = 65K` elements to be loaded in total, where K is the size of the shared reduction dim. Conversely, a local work group of size `{8, 8, 1}` would require 8 unique rows / 8 unique columns resulting in only `(8 + 8) * K = 16K` unique elements to be loaded.
This highlights the need to use dedicated logic to compute work group sizes for matrix multiplication shaders.
## Changes
* Introduce `pick_hw_square_wg_size`
* Use the new local work group size determination function for Quantized Linear, Matmul, and Linear
Differential Revision: [D79813236](https://our.internmc.facebook.com/intern/diff/D79813236/)
[ghstack-poisoned]
SS-JIA
pushed a commit
that referenced
this pull request
Aug 13, 2025
## Context
Currently `default_pick_local_wg_size()` (which internally calls `ComputeGraph::create_local_wg_size`) is used to select the local work group size for matrix multiplication ops. However, these functions currently bias the size of the local work group towards the largest dim of the global work group producing local wg sizes like
```
shader globalwg size localwg size
=========== ===================== ==================== =============
linear_qga4w_tiled_texture3d_texture3d_texture2d_float {256, 29, 1} {32, 2, 1} 1487
matmul_naive_texture3d_float {29, 115, 32} {4, 2, 8} 712
```
for matrix multiplication shaders. This behaviour was introduced in D64418632 / #6409.
However, through experimental testing a "square" work group size of `{8, 8, 1}` works a lot better for matrix multiplication shaders. The theoretical analysis for this behaviour is that the local work group size determines the memory locations that need to be loaded to compute the overall work group. For a work group with size `{W, H, 1}` the data required to compute the output would be `W * OUTPUT_TILE_W` columns of the weight tensor and `H * OUTPUT_TILE_H` rows of the input tensor. Note that all work group items in the same W index will be requesting the same columns from the weight tensor, and all work group items in the same H index will be requesting the same rows from the input tensor.
If `H==W`, then that "balances" the amount of data needed to loaded from each input tensor and may result in better data sharing behaviour among all work group items. Assuming `OUTPUT_TILE_W == OUTPUT_TILE_H == 1`, a local work group of size `{64, 1, 1}` would require 1 unique row from the input tensor an 64 unique columns to be loaded from the weight tensor, resulting in `(1 + 64) * K = 65K` elements to be loaded in total, where K is the size of the shared reduction dim. Conversely, a local work group of size `{8, 8, 1}` would require 8 unique rows / 8 unique columns resulting in only `(8 + 8) * K = 16K` unique elements to be loaded.
This highlights the need to use dedicated logic to compute work group sizes for matrix multiplication shaders.
## Changes
* Introduce `pick_hw_square_wg_size`
* Use the new local work group size determination function for Quantized Linear, Matmul, and Linear
Differential Revision: [D79813236](https://our.internmc.facebook.com/intern/diff/D79813236/)
[ghstack-poisoned]
SS-JIA
pushed a commit
that referenced
this pull request
Aug 13, 2025
## Context
Currently `default_pick_local_wg_size()` (which internally calls `ComputeGraph::create_local_wg_size`) is used to select the local work group size for matrix multiplication ops. However, these functions currently bias the size of the local work group towards the largest dim of the global work group producing local wg sizes like
```
shader globalwg size localwg size
=========== ===================== ==================== =============
linear_qga4w_tiled_texture3d_texture3d_texture2d_float {256, 29, 1} {32, 2, 1} 1487
matmul_naive_texture3d_float {29, 115, 32} {4, 2, 8} 712
```
for matrix multiplication shaders. This behaviour was introduced in D64418632 / #6409.
However, through experimental testing a "square" work group size of `{8, 8, 1}` works a lot better for matrix multiplication shaders. The theoretical analysis for this behaviour is that the local work group size determines the memory locations that need to be loaded to compute the overall work group. For a work group with size `{W, H, 1}` the data required to compute the output would be `W * OUTPUT_TILE_W` columns of the weight tensor and `H * OUTPUT_TILE_H` rows of the input tensor. Note that all work group items in the same W index will be requesting the same columns from the weight tensor, and all work group items in the same H index will be requesting the same rows from the input tensor.
If `H==W`, then that "balances" the amount of data needed to loaded from each input tensor and may result in better data sharing behaviour among all work group items. Assuming `OUTPUT_TILE_W == OUTPUT_TILE_H == 1`, a local work group of size `{64, 1, 1}` would require 1 unique row from the input tensor an 64 unique columns to be loaded from the weight tensor, resulting in `(1 + 64) * K = 65K` elements to be loaded in total, where K is the size of the shared reduction dim. Conversely, a local work group of size `{8, 8, 1}` would require 8 unique rows / 8 unique columns resulting in only `(8 + 8) * K = 16K` unique elements to be loaded.
This highlights the need to use dedicated logic to compute work group sizes for matrix multiplication shaders.
## Changes
* Introduce `pick_hw_square_wg_size`
* Use the new local work group size determination function for Quantized Linear, Matmul, and Linear
Differential Revision: [D79813236](https://our.internmc.facebook.com/intern/diff/D79813236/)
[ghstack-poisoned]
SS-JIA
pushed a commit
that referenced
this pull request
Aug 13, 2025
## Context
Currently `default_pick_local_wg_size()` (which internally calls `ComputeGraph::create_local_wg_size`) is used to select the local work group size for matrix multiplication ops. However, these functions currently bias the size of the local work group towards the largest dim of the global work group producing local wg sizes like
```
shader globalwg size localwg size
=========== ===================== ==================== =============
linear_qga4w_tiled_texture3d_texture3d_texture2d_float {256, 29, 1} {32, 2, 1} 1487
matmul_naive_texture3d_float {29, 115, 32} {4, 2, 8} 712
```
for matrix multiplication shaders. This behaviour was introduced in D64418632 / #6409.
However, through experimental testing a "square" work group size of `{8, 8, 1}` works a lot better for matrix multiplication shaders. The theoretical analysis for this behaviour is that the local work group size determines the memory locations that need to be loaded to compute the overall work group. For a work group with size `{W, H, 1}` the data required to compute the output would be `W * OUTPUT_TILE_W` columns of the weight tensor and `H * OUTPUT_TILE_H` rows of the input tensor. Note that all work group items in the same W index will be requesting the same columns from the weight tensor, and all work group items in the same H index will be requesting the same rows from the input tensor.
If `H==W`, then that "balances" the amount of data needed to loaded from each input tensor and may result in better data sharing behaviour among all work group items. Assuming `OUTPUT_TILE_W == OUTPUT_TILE_H == 1`, a local work group of size `{64, 1, 1}` would require 1 unique row from the input tensor an 64 unique columns to be loaded from the weight tensor, resulting in `(1 + 64) * K = 65K` elements to be loaded in total, where K is the size of the shared reduction dim. Conversely, a local work group of size `{8, 8, 1}` would require 8 unique rows / 8 unique columns resulting in only `(8 + 8) * K = 16K` unique elements to be loaded.
This highlights the need to use dedicated logic to compute work group sizes for matrix multiplication shaders.
## Changes
* Introduce `pick_hw_square_wg_size`
* Use the new local work group size determination function for Quantized Linear, Matmul, and Linear
Differential Revision: [D79813236](https://our.internmc.facebook.com/intern/diff/D79813236/)
[ghstack-poisoned]
SS-JIA
pushed a commit
that referenced
this pull request
Aug 13, 2025
## Context
Currently `default_pick_local_wg_size()` (which internally calls `ComputeGraph::create_local_wg_size`) is used to select the local work group size for matrix multiplication ops. However, these functions currently bias the size of the local work group towards the largest dim of the global work group producing local wg sizes like
```
shader globalwg size localwg size
=========== ===================== ==================== =============
linear_qga4w_tiled_texture3d_texture3d_texture2d_float {256, 29, 1} {32, 2, 1} 1487
matmul_naive_texture3d_float {29, 115, 32} {4, 2, 8} 712
```
for matrix multiplication shaders. This behaviour was introduced in D64418632 / #6409.
However, through experimental testing a "square" work group size of `{8, 8, 1}` works a lot better for matrix multiplication shaders. The theoretical analysis for this behaviour is that the local work group size determines the memory locations that need to be loaded to compute the overall work group. For a work group with size `{W, H, 1}` the data required to compute the output would be `W * OUTPUT_TILE_W` columns of the weight tensor and `H * OUTPUT_TILE_H` rows of the input tensor. Note that all work group items in the same W index will be requesting the same columns from the weight tensor, and all work group items in the same H index will be requesting the same rows from the input tensor.
If `H==W`, then that "balances" the amount of data needed to loaded from each input tensor and may result in better data sharing behaviour among all work group items. Assuming `OUTPUT_TILE_W == OUTPUT_TILE_H == 1`, a local work group of size `{64, 1, 1}` would require 1 unique row from the input tensor an 64 unique columns to be loaded from the weight tensor, resulting in `(1 + 64) * K = 65K` elements to be loaded in total, where K is the size of the shared reduction dim. Conversely, a local work group of size `{8, 8, 1}` would require 8 unique rows / 8 unique columns resulting in only `(8 + 8) * K = 16K` unique elements to be loaded.
This highlights the need to use dedicated logic to compute work group sizes for matrix multiplication shaders.
## Changes
* Introduce `pick_hw_square_wg_size`
* Use the new local work group size determination function for Quantized Linear, Matmul, and Linear
Differential Revision: [D79813236](https://our.internmc.facebook.com/intern/diff/D79813236/)
[ghstack-poisoned]
SS-JIA
pushed a commit
that referenced
this pull request
Aug 13, 2025
## Context
Currently `default_pick_local_wg_size()` (which internally calls `ComputeGraph::create_local_wg_size`) is used to select the local work group size for matrix multiplication ops. However, these functions currently bias the size of the local work group towards the largest dim of the global work group producing local wg sizes like
```
shader globalwg size localwg size
=========== ===================== ==================== =============
linear_qga4w_tiled_texture3d_texture3d_texture2d_float {256, 29, 1} {32, 2, 1} 1487
matmul_naive_texture3d_float {29, 115, 32} {4, 2, 8} 712
```
for matrix multiplication shaders. This behaviour was introduced in D64418632 / #6409.
However, through experimental testing a "square" work group size of `{8, 8, 1}` works a lot better for matrix multiplication shaders. The theoretical analysis for this behaviour is that the local work group size determines the memory locations that need to be loaded to compute the overall work group. For a work group with size `{W, H, 1}` the data required to compute the output would be `W * OUTPUT_TILE_W` columns of the weight tensor and `H * OUTPUT_TILE_H` rows of the input tensor. Note that all work group items in the same W index will be requesting the same columns from the weight tensor, and all work group items in the same H index will be requesting the same rows from the input tensor.
If `H==W`, then that "balances" the amount of data needed to loaded from each input tensor and may result in better data sharing behaviour among all work group items. Assuming `OUTPUT_TILE_W == OUTPUT_TILE_H == 1`, a local work group of size `{64, 1, 1}` would require 1 unique row from the input tensor an 64 unique columns to be loaded from the weight tensor, resulting in `(1 + 64) * K = 65K` elements to be loaded in total, where K is the size of the shared reduction dim. Conversely, a local work group of size `{8, 8, 1}` would require 8 unique rows / 8 unique columns resulting in only `(8 + 8) * K = 16K` unique elements to be loaded.
This highlights the need to use dedicated logic to compute work group sizes for matrix multiplication shaders.
## Changes
* Introduce `pick_hw_square_wg_size`
* Use the new local work group size determination function for Quantized Linear, Matmul, and Linear
Differential Revision: [D79813236](https://our.internmc.facebook.com/intern/diff/D79813236/)
agrima1304
pushed a commit
to agrima1304/executorch
that referenced
this pull request
Aug 26, 2025
## Context
Currently `default_pick_local_wg_size()` (which internally calls `ComputeGraph::create_local_wg_size`) is used to select the local work group size for matrix multiplication ops. However, these functions currently bias the size of the local work group towards the largest dim of the global work group producing local wg sizes like
```
shader globalwg size localwg size
=========== ===================== ==================== =============
linear_qga4w_tiled_texture3d_texture3d_texture2d_float {256, 29, 1} {32, 2, 1} 1487
matmul_naive_texture3d_float {29, 115, 32} {4, 2, 8} 712
```
for matrix multiplication shaders. This behaviour was introduced in D64418632 / pytorch#6409.
However, through experimental testing a "square" work group size of `{8, 8, 1}` works a lot better for matrix multiplication shaders. The theoretical analysis for this behaviour is that the local work group size determines the memory locations that need to be loaded to compute the overall work group. For a work group with size `{W, H, 1}` the data required to compute the output would be `W * OUTPUT_TILE_W` columns of the weight tensor and `H * OUTPUT_TILE_H` rows of the input tensor. Note that all work group items in the same W index will be requesting the same columns from the weight tensor, and all work group items in the same H index will be requesting the same rows from the input tensor.
If `H==W`, then that "balances" the amount of data needed to loaded from each input tensor and may result in better data sharing behaviour among all work group items. Assuming `OUTPUT_TILE_W == OUTPUT_TILE_H == 1`, a local work group of size `{64, 1, 1}` would require 1 unique row from the input tensor an 64 unique columns to be loaded from the weight tensor, resulting in `(1 + 64) * K = 65K` elements to be loaded in total, where K is the size of the shared reduction dim. Conversely, a local work group of size `{8, 8, 1}` would require 8 unique rows / 8 unique columns resulting in only `(8 + 8) * K = 16K` unique elements to be loaded.
This highlights the need to use dedicated logic to compute work group sizes for matrix multiplication shaders.
## Changes
* Introduce `pick_hw_square_wg_size`
* Use the new local work group size determination function for Quantized Linear, Matmul, and Linear
Differential Revision: [D79813236](https://our.internmc.facebook.com/intern/diff/D79813236/)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
This diff changes the local workgroup size calculation logic in the Vulkan backend of Executorch.
The workgroup size of the largest axis is kept largest so workgroups are better occupied.
The workgroup size is calculated based on the warp size of 32. When kernel is 2 dimensional largest axis is kept close to warp size it, so threads in the same warp Read / Write to consecutive memory locations, thus improving performance.
Reviewed By: SS-JIA
Differential Revision: D64418632