Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14187
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 235536e with merge base 44972ad ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
SS-JIA
left a comment
There was a problem hiding this comment.
Overall LGTM. Just some questions about things tried and some small requests for stylistic changes. Thanks for working on this!
| outputTexel[3] += dot(inputVec, vec4(weight1OutputChannelPacked[3], weight2OutputChannelPacked[3], weight3OutputChannelPacked[3], weight4OutputChannelPacked[3])); | ||
| } | ||
|
|
||
| imageStore(t_out, ivec3(xIdx, yIdx, gid1), op(outputTexel, out_min, out_max)); |
There was a problem hiding this comment.
iiuc, previously the shader calculated 4 output texels but the new one only calculates one. Have you experimented with computing a bigger output tile? Might be something that can get us a further boost. Note that I am also ok with landing the shader in its current form though, given the perf improvement.
There was a problem hiding this comment.
I tried a bigger output tile, and on my end it's slightly slower for MobileNet's case. I believe this is due to the excessive vector registers that are used when we increase the tile size.
| weight4OutputChannelPacked = texelFetch(t_kernel, ivec2(inputC * 4 + 3, gid1), 0); | ||
|
|
||
| const vec4 bias = texelFetch(t_bias, ivec2(out_pos_z, 0), 0); | ||
| outputTexel[0] += dot(inputVec, vec4(weight1OutputChannelPacked[0], weight2OutputChannelPacked[0], weight3OutputChannelPacked[0], weight4OutputChannelPacked[0])); |
There was a problem hiding this comment.
In my experience, computing matmul-like operations with dot was not as fast as fma . Did you try computing with fma as well? Curious to know if you have had a different experience. Btw you can take a look at the infographic in the old shader's comments to see how fma can be used instead of dot.
There was a problem hiding this comment.
Just tried it on my end. For me it's essentially the exact same for me to use fma instead of dot. Might be some compiler magic happening that makes it be the same operations.
|
@pytorchbot cherry-pick --onto release/1.0 -c fixnewfeature |
This change improves the execution of the pointwise conv2d s1p0 shader. It does through more of a GEMM-like implementation and employing more explicit loop unrolling. cc @SS-JIA @manuelcandales @cbilgin (cherry picked from commit b265324)
Cherry picking #14187The cherry pick PR is at #14724 and it is recommended to link a fixnewfeature cherry pick PR with an issue. The following tracker issues are updated: Details for Dev Infra teamRaised by workflow job |
This change improves the execution of the pointwise conv2d s1p0 shader. It does through more of a GEMM-like implementation and employing more explicit loop unrolling. cc @SS-JIA @manuelcandales @cbilgin
This change improves the execution of the pointwise conv2d s1p0 shader. It does through more of a GEMM-like implementation and employing more explicit loop unrolling.
cc @SS-JIA @manuelcandales @cbilgin