[wasm] Implement I2 and I4 shuffles in the jiterpreter#86469
Conversation
…CL char operations won't be terribly slow
|
Tagging subscribers to 'arch-wasm': @lewing Issue DetailsEnabling v128 and packedsimd support causes code that relies on i2/i4 shuffle to run. The scalar fallback implementation of these is extremely slow, so that makes algorithms dependent on those shuffles much slower at least according to browser-bench. This PR adds implementations for those shuffles by taking the short/int sized shuffle vectors, narrowing them to bytes, then replicating the narrowed vector across all the lanes and adding bits in order to produce an equivalent byte shuffle vector. Then it uses the wasm swizzle bytes intrinsic to emulate the desired shuffle operation. In my testing this speeds up 'Span, Reverse chars' from browser-bench considerably (~0.12ms -> 0.04ms) but does not seem to speed up IndexOf or SequenceEqual for chars. I'm not sure why, but one guess is that the cost of converting the shuffle vectors every operation is too significant. A future improvement would be to detect constant shuffle vectors and perform the full expansion at JIT time, which might close the gap. You can see an example of how a constant shuffle vector still generates the full emulation logic here:
|
|
Incidentally, the code AOT generates for this (and the code clang generates by default using its most similar three-operand shuffle intrinsic) does a bunch of extract and replace lane operations instead. I don't really understand why that's the chosen approach to emulate shuffles, it seems like it would be much more expensive and the generated code is enormous. Do you have any idea why, @radekdoulik ? |

Enabling v128 and packedsimd support causes code that relies on i2/i4 shuffle to run. The scalar fallback implementation of these is extremely slow, so that makes algorithms dependent on those shuffles much slower at least according to browser-bench.
This PR adds implementations for those shuffles by taking the short/int sized shuffle vectors, narrowing them to bytes, then replicating the narrowed vector across all the lanes and adding bits in order to produce an equivalent byte shuffle vector. Then it uses the wasm swizzle bytes intrinsic to emulate the desired shuffle operation.
In my testing this speeds up 'Span, Reverse chars' from browser-bench considerably (~0.12ms -> 0.04ms) but does not seem to speed up IndexOf or SequenceEqual for chars. I'm not sure why, but one guess is that the cost of converting the shuffle vectors every operation is too significant. A future improvement would be to detect constant shuffle vectors and perform the full expansion at JIT time, which might close the gap. You can see an example of how a constant shuffle vector still generates the full emulation logic here:

Since we know the vector at offset 160 is constant we can safely remove all of that work at some point. I'm not sure whether it makes sense to try and do this in the interp at transform time, it's probably better to do it in jiterp.