[Arm64] Treat Math/MathF.FusedMultiplyAdd as intrinsics#40124
[Arm64] Treat Math/MathF.FusedMultiplyAdd as intrinsics#40124echesakov merged 3 commits intodotnet:masterfrom
Conversation
AdvSimd.FusedMultiplyAddScalar( Vector64.CreateScalarUnsafe(z), Vector64.CreateScalarUnsafe(y), Vector64.CreateScalarUnsafe(x)).ToScalar() on Arm64 in importer.cpp
…h lowerarmarch.cpp
|
@dotnet/jit-contrib @tannergooding I think this should be ready for review now. There is no codegen changes across all frameworks libraries since there no uses of |
ed89d56 to
833aaba
Compare
| { | ||
| simdRegToSimdRegMove = (intrin.op1->TypeGet() == TYP_DOUBLE); | ||
| } | ||
| else if ((intrin.id == NI_Vector64_ToScalar) || (intrin.id == NI_Vector128_ToScalar)) |
There was a problem hiding this comment.
intrin.id == NI_Vector64_ToScalar [](start = 18, length = 33)
should we also special case for GetElement(0)
There was a problem hiding this comment.
Yes, we could this for GetElement(0) and some others (e.g. AdvSimd.Extract with 0 index).
I decided not to do optimization part at the moment and keep such changes for .NET 6.0.
The reason why I changed LSRA in this PR - because treating MathF.FusedMultiplyAdd as an intrincis introduced the annoying and unnecessary fmov (as you can see in the codegen diffs) and I wanted to avoid that.
There was a problem hiding this comment.
Would it make sense to open an issue for .NET 6 to ensure that we are doing this for all intrinsics that have move/copy semantics?
There was a problem hiding this comment.
There are likely a lot of places where we could introduce simple optimizations for things like extractions or shuffles (on both x86 and ARM).
LLVM/MSVC do many of these already and it might be good to construct a general list of them for potential .NET 6 work
| { | ||
| simdRegToSimdRegMove = (intrin.op1->TypeGet() == TYP_DOUBLE); | ||
| } | ||
| else if ((intrin.id == NI_Vector64_ToScalar) || (intrin.id == NI_Vector128_ToScalar)) |
There was a problem hiding this comment.
Would it make sense to open an issue for .NET 6 to ensure that we are doing this for all intrinsics that have move/copy semantics?
|
|
||
| // If we have an RMW intrinsic or an intrinsic with simple move semantic between two SIMD registers, | ||
| // we want to preference op1Reg to the target if op1 is not contained. | ||
| if (isRMW || simdRegToSimdRegMove) |
There was a problem hiding this comment.
simdRegToSimdRegMove is to generate better code when the value is lastUse, right?
Do we want a similar optimization for x86 and should we log an issue to track it?
There was a problem hiding this comment.
simdRegToSimdRegMoveis to generate better code when the value is lastUse, right?
Correct, if the op1Reg is not last use than LSRA should pick a different register for targetReg.
Do we want a similar optimization for x86 and should we log an issue to track it?
Yes, and I agree with @CarolEidt that we need a separate issue to track all of them for .NET 6.
|
Opened #40489 |
* Transform Math{F}.FusedMultiplyAdd(x,y,z) into
AdvSimd.FusedMultiplyAddScalar(
Vector64.CreateScalarUnsafe(z),
Vector64.CreateScalarUnsafe(y),
Vector64.CreateScalarUnsafe(x)).ToScalar() on Arm64 in importer.cpp
* Add containment analysis for AdvSimd_FusedMultiplyAddScalar in lower.h lowerarmarch.cpp
* Set tgtPrefOp1 for intrinsics with SIMD-to-SIMD move semantics in lsraarm64.cpp
This PR consists of three parts:
Math.FusedMultiplyAddandMathF.FusedMultiplyAddas JIT intrinsics by transforming a call toMathF.FusedMultiplyAddintoand transforming a call to
Math.FusedMultiplyAddintoThe second change is not required to resolve #40078 but would be nice to have and basically is arm64 implementation of work that @EgorBo did in dotnet/coreclr#27060
I attached JitDisasm for methods in MathFusedMultiplyAdd_ro before and after the changes.
MathFusedMultiplyAdd_ro-After.txt
MathFusedMultiplyAdd_ro-Before.txt
As an example, the following are two assembly listings for
Check5(double,double,double)that corresponds toMath.FusedMultiplyAdd(op1, -op2, op3)Before:
After:
op1RegfortargetRegof intrinsics that have "simd-to-simd move" semantics. For example, bothVector64.CreateScalarUnsafe(float)andAdvSimd.Arm64.DuplicateToVector64(double)(this is whatVector64.Create(double)gets lowered to) have this semantics. With that change the code forCheck5have one less instruction:Thanks @CarolEidt for proposing the solution to the issue with redundant
fmovinstruction.The following are JitDisasm for methods in MathFusedMultiplyAdd_ro after all the changes (including the ones to LSRA).
MathFusedMultiplyAdd_ro-After2.txt