Calculating axis's local wg size based on global workload and making it as close as possible to warp size of 32.#6409

Merged

facebook-github-bot merged 1 commit intopytorch:mainfrom

trviv:export-D64418632

Oct 22, 2024

Contributor

trviv commented Oct 21, 2024

Summary:
This diff changes the local workgroup size calculation logic in the Vulkan backend of Executorch.

The workgroup size of the largest axis is kept largest so workgroups are better occupied.
The workgroup size is calculated based on the warp size of 32. When kernel is 2 dimensional largest axis is kept close to warp size it, so threads in the same warp Read / Write to consecutive memory locations, thus improving performance.

Reviewed By: SS-JIA

Differential Revision: D64418632

pytorch-bot bot commented Oct 21, 2024 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/6409

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit eaf3bf9 with merge base 8c96805 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot added the CLA Signed label

Contributor

facebook-github-bot commented Oct 21, 2024

This pull request was exported from Phabricator. Differential Revision: D64418632

facebook-github-bot added the fb-exported label

SS-JIA approved these changes

View reviewed changes

trviv added a commit to trviv/executorch that referenced this pull request


          Calculating axis's local wg size based on global workload and making …

0c538c4

…it as close as possible to warp size of 32. (pytorch#6409)

Summary:

This diff changes the local workgroup size calculation logic in the Vulkan backend of Executorch.

The workgroup size of the largest axis is kept largest so workgroups are better occupied.
The workgroup size is calculated based on the warp size of 32. When kernel is 2 dimensional largest axis is kept close to warp size it, so threads in the same warp Read / Write to consecutive memory locations, thus improving performance.

Reviewed By: SS-JIA

Differential Revision: D64418632

trviv force-pushed the export-D64418632 branch from 59f7b82 to 0c538c4 Compare

October 22, 2024 13:11

Contributor

facebook-github-bot commented Oct 22, 2024

This pull request was exported from Phabricator. Differential Revision: D64418632

trviv added a commit to trviv/executorch that referenced this pull request


          Calculating axis's local wg size based on global workload and making …

b79149b

…it as close as possible to warp size of 32. (pytorch#6409)

Summary:

This diff changes the local workgroup size calculation logic in the Vulkan backend of Executorch.

The workgroup size of the largest axis is kept largest so workgroups are better occupied.
The workgroup size is calculated based on the warp size of 32. When kernel is 2 dimensional largest axis is kept close to warp size it, so threads in the same warp Read / Write to consecutive memory locations, thus improving performance.

Reviewed By: SS-JIA

Differential Revision: D64418632

trviv force-pushed the export-D64418632 branch from 0c538c4 to b79149b Compare

October 22, 2024 13:28

Contributor

facebook-github-bot commented Oct 22, 2024

This pull request was exported from Phabricator. Differential Revision: D64418632

trviv added a commit to trviv/executorch that referenced this pull request


          Calculating axis's local wg size based on global workload and making …

615d411

…it as close as possible to warp size of 32. (pytorch#6409)

Summary:

This diff changes the local workgroup size calculation logic in the Vulkan backend of Executorch.

The workgroup size of the largest axis is kept largest so workgroups are better occupied.
The workgroup size is calculated based on the warp size of 32. When kernel is 2 dimensional largest axis is kept close to warp size it, so threads in the same warp Read / Write to consecutive memory locations, thus improving performance.

Reviewed By: SS-JIA

Differential Revision: D64418632

trviv force-pushed the export-D64418632 branch from b79149b to 615d411 Compare

October 22, 2024 14:21

Contributor

facebook-github-bot commented Oct 22, 2024

This pull request was exported from Phabricator. Differential Revision: D64418632

trviv added a commit to trviv/executorch that referenced this pull request


          Calculating axis's local wg size based on global workload and making …

6d93b0b

…it as close as possible to warp size of 32. (pytorch#6409)

Summary:

This diff changes the local workgroup size calculation logic in the Vulkan backend of Executorch.

The workgroup size of the largest axis is kept largest so workgroups are better occupied.
The workgroup size is calculated based on the warp size of 32. When kernel is 2 dimensional largest axis is kept close to warp size it, so threads in the same warp Read / Write to consecutive memory locations, thus improving performance.

Reviewed By: SS-JIA

Differential Revision: D64418632

trviv force-pushed the export-D64418632 branch from 615d411 to 6d93b0b Compare

October 22, 2024 16:30

Contributor

facebook-github-bot commented Oct 22, 2024

This pull request was exported from Phabricator. Differential Revision: D64418632


          Calculating axis's local wg size based on global workload and making …

eaf3bf9

…it as close as possible to warp size of 32. (pytorch#6409)

Summary:

This diff changes the local workgroup size calculation logic in the Vulkan backend of Executorch.

The workgroup size of the largest axis is kept largest so workgroups are better occupied.
The workgroup size is calculated based on the warp size of 32. When kernel is 2 dimensional largest axis is kept close to warp size it, so threads in the same warp Read / Write to consecutive memory locations, thus improving performance.

Reviewed By: SS-JIA

Differential Revision: D64418632

trviv force-pushed the export-D64418632 branch from 6d93b0b to eaf3bf9 Compare

October 22, 2024 17:27

Contributor

facebook-github-bot commented Oct 22, 2024

This pull request was exported from Phabricator. Differential Revision: D64418632

facebook-github-bot merged commit fe20be9 into pytorch:main

SS-JIA added a commit that referenced this pull request


          [ET-VK] Better work group sizes for matmul

df79828

## Context

Currently `default_pick_local_wg_size()` (which internally calls `ComputeGraph::create_local_wg_size`) is used to select the local work group size for matrix multiplication ops. However, these functions currently bias the size of the local work group towards the largest dim of the global work group producing local wg sizes like

```
shader                                                                          globalwg size            localwg size
===========                                                                     =====================    ====================                 =============
linear_qga4w_tiled_texture3d_texture3d_texture2d_float                          {256, 29, 1}             {32, 2, 1}                                    1487
matmul_naive_texture3d_float                                                    {29, 115, 32}            {4, 2, 8}                                      712
```

for matrix multiplication shaders. This behaviour was introduced in D64418632 / #6409.

However, through experimental testing a "square" work group size of `{8, 8, 1}` works a lot better for matrix multiplication shaders. The theoretical analysis for this behaviour is that the local work group size determines the memory locations that need to be loaded to compute the overall work group. For a work group with size `{W, H, 1}` the data required to compute the output would be `W * OUTPUT_TILE_W` columns of the weight tensor and `H * OUTPUT_TILE_H` rows of the input tensor. Note that all work group items in the same W index will be requesting the same columns from the weight tensor, and all work group items in the same H index will be requesting the same rows from the input tensor.

If `H==W`, then that "balances" the amount of data needed to loaded from each input tensor and may result in better data sharing behaviour among all work group items. Assuming `OUTPUT_TILE_W == OUTPUT_TILE_H == 1`, a local work group of size `{64, 1, 1}` would require 1 unique row from the input tensor an 64 unique columns to be loaded from the weight tensor, resulting in `(1 + 64) * K = 65K` elements to be loaded in total, where K is the size of the shared reduction dim. Conversely, a local work group of size `{8, 8, 1}` would require 8 unique rows / 8 unique columns resulting in only `(8 + 8) * K = 16K` unique elements to be loaded.

This highlights the need to use dedicated logic to compute work group sizes for matrix multiplication shaders.

## Changes

* Introduce `pick_hw_square_wg_size`
* Use the new local work group size determination function for Quantized Linear, Matmul, and Linear

Differential Revision: [D79813236](https://our.internmc.facebook.com/intern/diff/D79813236/)

[ghstack-poisoned]

SS-JIA mentioned this pull request

[ET-VK] Better work group sizes for matmul #13185

Merged

SS-JIA added a commit that referenced this pull request


          [ET-VK] Better work group sizes for matmul

9e96749

## Context

Currently `default_pick_local_wg_size()` (which internally calls `ComputeGraph::create_local_wg_size`) is used to select the local work group size for matrix multiplication ops. However, these functions currently bias the size of the local work group towards the largest dim of the global work group producing local wg sizes like

```
shader                                                                          globalwg size            localwg size
===========                                                                     =====================    ====================                 =============
linear_qga4w_tiled_texture3d_texture3d_texture2d_float                          {256, 29, 1}             {32, 2, 1}                                    1487
matmul_naive_texture3d_float                                                    {29, 115, 32}            {4, 2, 8}                                      712
```

for matrix multiplication shaders. This behaviour was introduced in D64418632 / #6409.

However, through experimental testing a "square" work group size of `{8, 8, 1}` works a lot better for matrix multiplication shaders. The theoretical analysis for this behaviour is that the local work group size determines the memory locations that need to be loaded to compute the overall work group. For a work group with size `{W, H, 1}` the data required to compute the output would be `W * OUTPUT_TILE_W` columns of the weight tensor and `H * OUTPUT_TILE_H` rows of the input tensor. Note that all work group items in the same W index will be requesting the same columns from the weight tensor, and all work group items in the same H index will be requesting the same rows from the input tensor.

If `H==W`, then that "balances" the amount of data needed to loaded from each input tensor and may result in better data sharing behaviour among all work group items. Assuming `OUTPUT_TILE_W == OUTPUT_TILE_H == 1`, a local work group of size `{64, 1, 1}` would require 1 unique row from the input tensor an 64 unique columns to be loaded from the weight tensor, resulting in `(1 + 64) * K = 65K` elements to be loaded in total, where K is the size of the shared reduction dim. Conversely, a local work group of size `{8, 8, 1}` would require 8 unique rows / 8 unique columns resulting in only `(8 + 8) * K = 16K` unique elements to be loaded.

This highlights the need to use dedicated logic to compute work group sizes for matrix multiplication shaders.

## Changes

* Introduce `pick_hw_square_wg_size`
* Use the new local work group size determination function for Quantized Linear, Matmul, and Linear

Differential Revision: [D79813236](https://our.internmc.facebook.com/intern/diff/D79813236/)

ghstack-source-id: 301415132
Pull Request resolved: #13185

SS-JIA pushed a commit that referenced this pull request


          Update base for Update on "[ET-VK] Better work group sizes for matmul"

b2e2ae5

## Context

Currently `default_pick_local_wg_size()` (which internally calls `ComputeGraph::create_local_wg_size`) is used to select the local work group size for matrix multiplication ops. However, these functions currently bias the size of the local work group towards the largest dim of the global work group producing local wg sizes like

```
shader                                                                          globalwg size            localwg size
===========                                                                     =====================    ====================                 =============
linear_qga4w_tiled_texture3d_texture3d_texture2d_float                          {256, 29, 1}             {32, 2, 1}                                    1487
matmul_naive_texture3d_float                                                    {29, 115, 32}            {4, 2, 8}                                      712
```

for matrix multiplication shaders. This behaviour was introduced in D64418632 / #6409.

However, through experimental testing a "square" work group size of `{8, 8, 1}` works a lot better for matrix multiplication shaders. The theoretical analysis for this behaviour is that the local work group size determines the memory locations that need to be loaded to compute the overall work group. For a work group with size `{W, H, 1}` the data required to compute the output would be `W * OUTPUT_TILE_W` columns of the weight tensor and `H * OUTPUT_TILE_H` rows of the input tensor. Note that all work group items in the same W index will be requesting the same columns from the weight tensor, and all work group items in the same H index will be requesting the same rows from the input tensor.

If `H==W`, then that "balances" the amount of data needed to loaded from each input tensor and may result in better data sharing behaviour among all work group items. Assuming `OUTPUT_TILE_W == OUTPUT_TILE_H == 1`, a local work group of size `{64, 1, 1}` would require 1 unique row from the input tensor an 64 unique columns to be loaded from the weight tensor, resulting in `(1 + 64) * K = 65K` elements to be loaded in total, where K is the size of the shared reduction dim. Conversely, a local work group of size `{8, 8, 1}` would require 8 unique rows / 8 unique columns resulting in only `(8 + 8) * K = 16K` unique elements to be loaded.

This highlights the need to use dedicated logic to compute work group sizes for matrix multiplication shaders.

## Changes

* Introduce `pick_hw_square_wg_size`
* Use the new local work group size determination function for Quantized Linear, Matmul, and Linear

Differential Revision: [D79813236](https://our.internmc.facebook.com/intern/diff/D79813236/)

[ghstack-poisoned]

SS-JIA pushed a commit that referenced this pull request


          Update on "[ET-VK] Better work group sizes for matmul"

1555a43

## Context

Currently `default_pick_local_wg_size()` (which internally calls `ComputeGraph::create_local_wg_size`) is used to select the local work group size for matrix multiplication ops. However, these functions currently bias the size of the local work group towards the largest dim of the global work group producing local wg sizes like

```
shader                                                                          globalwg size            localwg size
===========                                                                     =====================    ====================                 =============
linear_qga4w_tiled_texture3d_texture3d_texture2d_float                          {256, 29, 1}             {32, 2, 1}                                    1487
matmul_naive_texture3d_float                                                    {29, 115, 32}            {4, 2, 8}                                      712
```

for matrix multiplication shaders. This behaviour was introduced in D64418632 / #6409.

However, through experimental testing a "square" work group size of `{8, 8, 1}` works a lot better for matrix multiplication shaders. The theoretical analysis for this behaviour is that the local work group size determines the memory locations that need to be loaded to compute the overall work group. For a work group with size `{W, H, 1}` the data required to compute the output would be `W * OUTPUT_TILE_W` columns of the weight tensor and `H * OUTPUT_TILE_H` rows of the input tensor. Note that all work group items in the same W index will be requesting the same columns from the weight tensor, and all work group items in the same H index will be requesting the same rows from the input tensor.

If `H==W`, then that "balances" the amount of data needed to loaded from each input tensor and may result in better data sharing behaviour among all work group items. Assuming `OUTPUT_TILE_W == OUTPUT_TILE_H == 1`, a local work group of size `{64, 1, 1}` would require 1 unique row from the input tensor an 64 unique columns to be loaded from the weight tensor, resulting in `(1 + 64) * K = 65K` elements to be loaded in total, where K is the size of the shared reduction dim. Conversely, a local work group of size `{8, 8, 1}` would require 8 unique rows / 8 unique columns resulting in only `(8 + 8) * K = 16K` unique elements to be loaded.

This highlights the need to use dedicated logic to compute work group sizes for matrix multiplication shaders.

## Changes

* Introduce `pick_hw_square_wg_size`
* Use the new local work group size determination function for Quantized Linear, Matmul, and Linear

Differential Revision: [D79813236](https://our.internmc.facebook.com/intern/diff/D79813236/)

[ghstack-poisoned]

SS-JIA pushed a commit that referenced this pull request


          Update base for Update on "[ET-VK] Better work group sizes for matmul"

9bf6348

## Context

Currently `default_pick_local_wg_size()` (which internally calls `ComputeGraph::create_local_wg_size`) is used to select the local work group size for matrix multiplication ops. However, these functions currently bias the size of the local work group towards the largest dim of the global work group producing local wg sizes like

```
shader                                                                          globalwg size            localwg size
===========                                                                     =====================    ====================                 =============
linear_qga4w_tiled_texture3d_texture3d_texture2d_float                          {256, 29, 1}             {32, 2, 1}                                    1487
matmul_naive_texture3d_float                                                    {29, 115, 32}            {4, 2, 8}                                      712
```

for matrix multiplication shaders. This behaviour was introduced in D64418632 / #6409.

However, through experimental testing a "square" work group size of `{8, 8, 1}` works a lot better for matrix multiplication shaders. The theoretical analysis for this behaviour is that the local work group size determines the memory locations that need to be loaded to compute the overall work group. For a work group with size `{W, H, 1}` the data required to compute the output would be `W * OUTPUT_TILE_W` columns of the weight tensor and `H * OUTPUT_TILE_H` rows of the input tensor. Note that all work group items in the same W index will be requesting the same columns from the weight tensor, and all work group items in the same H index will be requesting the same rows from the input tensor.

If `H==W`, then that "balances" the amount of data needed to loaded from each input tensor and may result in better data sharing behaviour among all work group items. Assuming `OUTPUT_TILE_W == OUTPUT_TILE_H == 1`, a local work group of size `{64, 1, 1}` would require 1 unique row from the input tensor an 64 unique columns to be loaded from the weight tensor, resulting in `(1 + 64) * K = 65K` elements to be loaded in total, where K is the size of the shared reduction dim. Conversely, a local work group of size `{8, 8, 1}` would require 8 unique rows / 8 unique columns resulting in only `(8 + 8) * K = 16K` unique elements to be loaded.

This highlights the need to use dedicated logic to compute work group sizes for matrix multiplication shaders.

## Changes

* Introduce `pick_hw_square_wg_size`
* Use the new local work group size determination function for Quantized Linear, Matmul, and Linear

Differential Revision: [D79813236](https://our.internmc.facebook.com/intern/diff/D79813236/)

[ghstack-poisoned]

SS-JIA pushed a commit that referenced this pull request


          Update on "[ET-VK] Better work group sizes for matmul"

563de32

## Context

Currently `default_pick_local_wg_size()` (which internally calls `ComputeGraph::create_local_wg_size`) is used to select the local work group size for matrix multiplication ops. However, these functions currently bias the size of the local work group towards the largest dim of the global work group producing local wg sizes like

```
shader                                                                          globalwg size            localwg size
===========                                                                     =====================    ====================                 =============
linear_qga4w_tiled_texture3d_texture3d_texture2d_float                          {256, 29, 1}             {32, 2, 1}                                    1487
matmul_naive_texture3d_float                                                    {29, 115, 32}            {4, 2, 8}                                      712
```

for matrix multiplication shaders. This behaviour was introduced in D64418632 / #6409.

However, through experimental testing a "square" work group size of `{8, 8, 1}` works a lot better for matrix multiplication shaders. The theoretical analysis for this behaviour is that the local work group size determines the memory locations that need to be loaded to compute the overall work group. For a work group with size `{W, H, 1}` the data required to compute the output would be `W * OUTPUT_TILE_W` columns of the weight tensor and `H * OUTPUT_TILE_H` rows of the input tensor. Note that all work group items in the same W index will be requesting the same columns from the weight tensor, and all work group items in the same H index will be requesting the same rows from the input tensor.

If `H==W`, then that "balances" the amount of data needed to loaded from each input tensor and may result in better data sharing behaviour among all work group items. Assuming `OUTPUT_TILE_W == OUTPUT_TILE_H == 1`, a local work group of size `{64, 1, 1}` would require 1 unique row from the input tensor an 64 unique columns to be loaded from the weight tensor, resulting in `(1 + 64) * K = 65K` elements to be loaded in total, where K is the size of the shared reduction dim. Conversely, a local work group of size `{8, 8, 1}` would require 8 unique rows / 8 unique columns resulting in only `(8 + 8) * K = 16K` unique elements to be loaded.

This highlights the need to use dedicated logic to compute work group sizes for matrix multiplication shaders.

## Changes

* Introduce `pick_hw_square_wg_size`
* Use the new local work group size determination function for Quantized Linear, Matmul, and Linear

Differential Revision: [D79813236](https://our.internmc.facebook.com/intern/diff/D79813236/)

[ghstack-poisoned]

SS-JIA pushed a commit that referenced this pull request


          Update base for Update on "[ET-VK] Better work group sizes for matmul"

93303c5

## Context

Currently `default_pick_local_wg_size()` (which internally calls `ComputeGraph::create_local_wg_size`) is used to select the local work group size for matrix multiplication ops. However, these functions currently bias the size of the local work group towards the largest dim of the global work group producing local wg sizes like

```
shader                                                                          globalwg size            localwg size
===========                                                                     =====================    ====================                 =============
linear_qga4w_tiled_texture3d_texture3d_texture2d_float                          {256, 29, 1}             {32, 2, 1}                                    1487
matmul_naive_texture3d_float                                                    {29, 115, 32}            {4, 2, 8}                                      712
```

for matrix multiplication shaders. This behaviour was introduced in D64418632 / #6409.

However, through experimental testing a "square" work group size of `{8, 8, 1}` works a lot better for matrix multiplication shaders. The theoretical analysis for this behaviour is that the local work group size determines the memory locations that need to be loaded to compute the overall work group. For a work group with size `{W, H, 1}` the data required to compute the output would be `W * OUTPUT_TILE_W` columns of the weight tensor and `H * OUTPUT_TILE_H` rows of the input tensor. Note that all work group items in the same W index will be requesting the same columns from the weight tensor, and all work group items in the same H index will be requesting the same rows from the input tensor.

If `H==W`, then that "balances" the amount of data needed to loaded from each input tensor and may result in better data sharing behaviour among all work group items. Assuming `OUTPUT_TILE_W == OUTPUT_TILE_H == 1`, a local work group of size `{64, 1, 1}` would require 1 unique row from the input tensor an 64 unique columns to be loaded from the weight tensor, resulting in `(1 + 64) * K = 65K` elements to be loaded in total, where K is the size of the shared reduction dim. Conversely, a local work group of size `{8, 8, 1}` would require 8 unique rows / 8 unique columns resulting in only `(8 + 8) * K = 16K` unique elements to be loaded.

This highlights the need to use dedicated logic to compute work group sizes for matrix multiplication shaders.

## Changes

* Introduce `pick_hw_square_wg_size`
* Use the new local work group size determination function for Quantized Linear, Matmul, and Linear

Differential Revision: [D79813236](https://our.internmc.facebook.com/intern/diff/D79813236/)

[ghstack-poisoned]

SS-JIA pushed a commit that referenced this pull request


          Update on "[ET-VK] Better work group sizes for matmul"

665483a

## Context

Currently `default_pick_local_wg_size()` (which internally calls `ComputeGraph::create_local_wg_size`) is used to select the local work group size for matrix multiplication ops. However, these functions currently bias the size of the local work group towards the largest dim of the global work group producing local wg sizes like

```
shader                                                                          globalwg size            localwg size
===========                                                                     =====================    ====================                 =============
linear_qga4w_tiled_texture3d_texture3d_texture2d_float                          {256, 29, 1}             {32, 2, 1}                                    1487
matmul_naive_texture3d_float                                                    {29, 115, 32}            {4, 2, 8}                                      712
```

for matrix multiplication shaders. This behaviour was introduced in D64418632 / #6409.

However, through experimental testing a "square" work group size of `{8, 8, 1}` works a lot better for matrix multiplication shaders. The theoretical analysis for this behaviour is that the local work group size determines the memory locations that need to be loaded to compute the overall work group. For a work group with size `{W, H, 1}` the data required to compute the output would be `W * OUTPUT_TILE_W` columns of the weight tensor and `H * OUTPUT_TILE_H` rows of the input tensor. Note that all work group items in the same W index will be requesting the same columns from the weight tensor, and all work group items in the same H index will be requesting the same rows from the input tensor.

If `H==W`, then that "balances" the amount of data needed to loaded from each input tensor and may result in better data sharing behaviour among all work group items. Assuming `OUTPUT_TILE_W == OUTPUT_TILE_H == 1`, a local work group of size `{64, 1, 1}` would require 1 unique row from the input tensor an 64 unique columns to be loaded from the weight tensor, resulting in `(1 + 64) * K = 65K` elements to be loaded in total, where K is the size of the shared reduction dim. Conversely, a local work group of size `{8, 8, 1}` would require 8 unique rows / 8 unique columns resulting in only `(8 + 8) * K = 16K` unique elements to be loaded.

This highlights the need to use dedicated logic to compute work group sizes for matrix multiplication shaders.

## Changes

* Introduce `pick_hw_square_wg_size`
* Use the new local work group size determination function for Quantized Linear, Matmul, and Linear

Differential Revision: [D79813236](https://our.internmc.facebook.com/intern/diff/D79813236/)

[ghstack-poisoned]

SS-JIA pushed a commit that referenced this pull request


          [ET-VK] Better work group sizes for matmul (#13378)

f95a3f7

## Context

Currently `default_pick_local_wg_size()` (which internally calls `ComputeGraph::create_local_wg_size`) is used to select the local work group size for matrix multiplication ops. However, these functions currently bias the size of the local work group towards the largest dim of the global work group producing local wg sizes like

```
shader                                                                          globalwg size            localwg size
===========                                                                     =====================    ====================                 =============
linear_qga4w_tiled_texture3d_texture3d_texture2d_float                          {256, 29, 1}             {32, 2, 1}                                    1487
matmul_naive_texture3d_float                                                    {29, 115, 32}            {4, 2, 8}                                      712
```

for matrix multiplication shaders. This behaviour was introduced in D64418632 / #6409.

However, through experimental testing a "square" work group size of `{8, 8, 1}` works a lot better for matrix multiplication shaders. The theoretical analysis for this behaviour is that the local work group size determines the memory locations that need to be loaded to compute the overall work group. For a work group with size `{W, H, 1}` the data required to compute the output would be `W * OUTPUT_TILE_W` columns of the weight tensor and `H * OUTPUT_TILE_H` rows of the input tensor. Note that all work group items in the same W index will be requesting the same columns from the weight tensor, and all work group items in the same H index will be requesting the same rows from the input tensor.

If `H==W`, then that "balances" the amount of data needed to loaded from each input tensor and may result in better data sharing behaviour among all work group items. Assuming `OUTPUT_TILE_W == OUTPUT_TILE_H == 1`, a local work group of size `{64, 1, 1}` would require 1 unique row from the input tensor an 64 unique columns to be loaded from the weight tensor, resulting in `(1 + 64) * K = 65K` elements to be loaded in total, where K is the size of the shared reduction dim. Conversely, a local work group of size `{8, 8, 1}` would require 8 unique rows / 8 unique columns resulting in only `(8 + 8) * K = 16K` unique elements to be loaded.

This highlights the need to use dedicated logic to compute work group sizes for matrix multiplication shaders.

## Changes

* Introduce `pick_hw_square_wg_size`
* Use the new local work group size determination function for Quantized Linear, Matmul, and Linear

Differential Revision: [D79813236](https://our.internmc.facebook.com/intern/diff/D79813236/)

agrima1304 pushed a commit to agrima1304/executorch that referenced this pull request


          [ET-VK] Better work group sizes for matmul (pytorch#13378)

da601a8

## Context

Currently `default_pick_local_wg_size()` (which internally calls `ComputeGraph::create_local_wg_size`) is used to select the local work group size for matrix multiplication ops. However, these functions currently bias the size of the local work group towards the largest dim of the global work group producing local wg sizes like

```
shader                                                                          globalwg size            localwg size
===========                                                                     =====================    ====================                 =============
linear_qga4w_tiled_texture3d_texture3d_texture2d_float                          {256, 29, 1}             {32, 2, 1}                                    1487
matmul_naive_texture3d_float                                                    {29, 115, 32}            {4, 2, 8}                                      712
```

for matrix multiplication shaders. This behaviour was introduced in D64418632 / pytorch#6409.

However, through experimental testing a "square" work group size of `{8, 8, 1}` works a lot better for matrix multiplication shaders. The theoretical analysis for this behaviour is that the local work group size determines the memory locations that need to be loaded to compute the overall work group. For a work group with size `{W, H, 1}` the data required to compute the output would be `W * OUTPUT_TILE_W` columns of the weight tensor and `H * OUTPUT_TILE_H` rows of the input tensor. Note that all work group items in the same W index will be requesting the same columns from the weight tensor, and all work group items in the same H index will be requesting the same rows from the input tensor.

If `H==W`, then that "balances" the amount of data needed to loaded from each input tensor and may result in better data sharing behaviour among all work group items. Assuming `OUTPUT_TILE_W == OUTPUT_TILE_H == 1`, a local work group of size `{64, 1, 1}` would require 1 unique row from the input tensor an 64 unique columns to be loaded from the weight tensor, resulting in `(1 + 64) * K = 65K` elements to be loaded in total, where K is the size of the shared reduction dim. Conversely, a local work group of size `{8, 8, 1}` would require 8 unique rows / 8 unique columns resulting in only `(8 + 8) * K = 16K` unique elements to be loaded.

This highlights the need to use dedicated logic to compute work group sizes for matrix multiplication shaders.

## Changes

* Introduce `pick_hw_square_wg_size`
* Use the new local work group size determination function for Quantized Linear, Matmul, and Linear

Differential Revision: [D79813236](https://our.internmc.facebook.com/intern/diff/D79813236/)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed fb-exported