Skip to content

[wasm] Implement I2 and I4 shuffles in the jiterpreter#86469

Merged
kg merged 2 commits intodotnet:mainfrom
kg:wasm-jiterp-shuffles
May 19, 2023
Merged

[wasm] Implement I2 and I4 shuffles in the jiterpreter#86469
kg merged 2 commits intodotnet:mainfrom
kg:wasm-jiterp-shuffles

Conversation

@kg
Copy link
Member

@kg kg commented May 18, 2023

Enabling v128 and packedsimd support causes code that relies on i2/i4 shuffle to run. The scalar fallback implementation of these is extremely slow, so that makes algorithms dependent on those shuffles much slower at least according to browser-bench.

This PR adds implementations for those shuffles by taking the short/int sized shuffle vectors, narrowing them to bytes, then replicating the narrowed vector across all the lanes and adding bits in order to produce an equivalent byte shuffle vector. Then it uses the wasm swizzle bytes intrinsic to emulate the desired shuffle operation.

In my testing this speeds up 'Span, Reverse chars' from browser-bench considerably (~0.12ms -> 0.04ms) but does not seem to speed up IndexOf or SequenceEqual for chars. I'm not sure why, but one guess is that the cost of converting the shuffle vectors every operation is too significant. A future improvement would be to detect constant shuffle vectors and perform the full expansion at JIT time, which might close the gap. You can see an example of how a constant shuffle vector still generates the full emulation logic here:
image
Since we know the vector at offset 160 is constant we can safely remove all of that work at some point. I'm not sure whether it makes sense to try and do this in the interp at transform time, it's probably better to do it in jiterp.

@kg kg added arch-wasm WebAssembly architecture area-Codegen-Jiterpreter-mono labels May 18, 2023
@kg kg requested review from kotlarmilos, radekdoulik and vargaz May 18, 2023 22:32
@kg kg requested review from lewing and pavelsavara as code owners May 18, 2023 22:32
@ghost ghost assigned kg May 18, 2023
@ghost
Copy link

ghost commented May 18, 2023

Tagging subscribers to 'arch-wasm': @lewing
See info in area-owners.md if you want to be subscribed.

Issue Details

Enabling v128 and packedsimd support causes code that relies on i2/i4 shuffle to run. The scalar fallback implementation of these is extremely slow, so that makes algorithms dependent on those shuffles much slower at least according to browser-bench.

This PR adds implementations for those shuffles by taking the short/int sized shuffle vectors, narrowing them to bytes, then replicating the narrowed vector across all the lanes and adding bits in order to produce an equivalent byte shuffle vector. Then it uses the wasm swizzle bytes intrinsic to emulate the desired shuffle operation.

In my testing this speeds up 'Span, Reverse chars' from browser-bench considerably (~0.12ms -> 0.04ms) but does not seem to speed up IndexOf or SequenceEqual for chars. I'm not sure why, but one guess is that the cost of converting the shuffle vectors every operation is too significant. A future improvement would be to detect constant shuffle vectors and perform the full expansion at JIT time, which might close the gap. You can see an example of how a constant shuffle vector still generates the full emulation logic here:
image
Since we know the vector at offset 160 is constant we can safely remove all of that work at some point. I'm not sure whether it makes sense to try and do this in the interp at transform time, it's probably better to do it in jiterp.

Author: kg
Assignees: -
Labels:

arch-wasm, area-Codegen-Jiterpreter-mono

Milestone: -

@kg
Copy link
Member Author

kg commented May 18, 2023

Incidentally, the code AOT generates for this (and the code clang generates by default using its most similar three-operand shuffle intrinsic) does a bunch of extract and replace lane operations instead. I don't really understand why that's the chosen approach to emulate shuffles, it seems like it would be much more expensive and the generated code is enormous. Do you have any idea why, @radekdoulik ?

@kg kg merged commit 40ba49a into dotnet:main May 19, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Jun 18, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

arch-wasm WebAssembly architecture

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants