[Arm64] Treat Math/MathF.FusedMultiplyAdd as intrinsics by echesakov · Pull Request #40124 · dotnet/runtime

echesakov · 2020-07-30T04:07:15Z

This PR consists of three parts:

Changes importer in a way that treats Math.FusedMultiplyAdd and MathF.FusedMultiplyAdd as JIT intrinsics by transforming a call to MathF.FusedMultiplyAdd into

return AdvSimd.FusedMultiplyAddScalar(
    Vector64.CreateScalarUnsafe(z),
    Vector64.CreateScalarUnsafe(y),
    Vector64.CreateScalarUnsafe(x)
    ).ToScalar();

and transforming a call to Math.FusedMultiplyAdd into

return AdvSimd.FusedMultiplyAddScalar(
    Vector64.Create(z),
    Vector64.Create(y),
    Vector64.Create(x)
    ).ToScalar();

Adds some sort of containment analysis to Lower that can "contain" negation of the intrinsic operands and by utilizing an appropriate fused intrinsics from the following list:

AdvSimd_FusedMultiplyAddScalar
AdvSimd_FusedMultiplyAddNegatedScalar
AdvSimd_FusedMultiplySubtractScalar
AdvSimd_FusedMultiplySubtractNegatedScalar

The second change is not required to resolve #40078 but would be nice to have and basically is arm64 implementation of work that @EgorBo did in dotnet/coreclr#27060

I attached JitDisasm for methods in MathFusedMultiplyAdd_ro before and after the changes.
MathFusedMultiplyAdd_ro-After.txt
MathFusedMultiplyAdd_ro-Before.txt

As an example, the following are two assembly listings for Check5(double,double,double) that corresponds to Math.FusedMultiplyAdd(op1, -op2, op3)

Before:

; Assembly listing for method MathFusedMultiplyAddTest.Program:Check5(double,double,double)
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,  4   )  double  ->  [fp+0x20]  
;  V01 arg1         [V01,T01] (  4,  4   )  double  ->  [fp+0x18]  
;  V02 arg2         [V02,T02] (  4,  4   )  double  ->  [fp+0x10]  
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;  V04 tmp1         [V04,T03] (  2,  4   )  double  ->   d8         "non-inline candidate call"
;  V05 tmp2         [V05,T04] (  2,  4   )  double  ->   d0         "argument with side effect"
;
; Lcl frame size = 24

G_M65527_IG01:
        A9BD7BFD          stp     fp, lr, [sp,#-48]!
        FD0017E8          str     d8, [sp,#40]
        910003FD          mov     fp, sp
						;; bbWeight=1    PerfScore 2.50
G_M65527_IG02:
        FD000FA1          str     d1, [fp,#24]
        1E614021          fneg    d1, d1
        FD0013A0          str     d0, [fp,#32]
        FD000BA2          str     d2, [fp,#16]
        97FFC317          bl      MathFusedMultiplyAddTest.Program:ReferenceMultiplyAdd(double,double,double):double
        1E604008          fmov    d8, d0
        FD400FA1          ldr     d1, [fp,#24]	// [V01 arg1]
        1E614021          fneg    d1, d1
        FD4013A0          ldr     d0, [fp,#32]	// [V00 arg0]
        FD400BA2          ldr     d2, [fp,#16]	// [V02 arg2]
        97FF26EF          bl      System.Math:FusedMultiplyAdd(double,double,double):double
        1E604001          fmov    d1, d0
        1E604100          fmov    d0, d8
        97FFC314          bl      MathFusedMultiplyAddTest.Program:CompareDoubles(double,double)
						;; bbWeight=1    PerfScore 17.50
G_M65527_IG03:
        FD4017E8          ldr     d8, [sp,#40]
        A8C37BFD          ldp     fp, lr, [sp],#48
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 4.00

; Total bytes of code 80, prolog size 12, PerfScore 32.00, (MethodHash=95450008) for method MathFusedMultiplyAddTest.Program:Check5(double,double,double)
; ============================================================

After:

; Assembly listing for method MathFusedMultiplyAddTest.Program:Check5(double,double,double)
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,  4   )  double  ->  [fp+0x28]  
;  V01 arg1         [V01,T01] (  4,  4   )  double  ->  [fp+0x20]  
;  V02 arg2         [V02,T02] (  4,  4   )  double  ->  [fp+0x18]  
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;  V04 tmp1         [V04,T03] (  2,  4   )  double  ->   d0         "argument with side effect"
;
; Lcl frame size = 32

G_M65527_IG01:
        A9BD7BFD          stp     fp, lr, [sp,#-48]!
        910003FD          mov     fp, sp
						;; bbWeight=1    PerfScore 1.50
G_M65527_IG02:
        FD0013A1          str     d1, [fp,#32]
        1E614021          fneg    d1, d1
        FD0017A0          str     d0, [fp,#40]
        FD000FA2          str     d2, [fp,#24]
        97FFC36C          bl      MathFusedMultiplyAddTest.Program:ReferenceMultiplyAdd(double,double,double):double
        FD400FA2          ldr     d2, [fp,#24]	// [V02 arg2]
        1E604041          fmov    d1, d2
        FD4013B0          ldr     d16, [fp,#32]	// [V01 arg1]
        FD4017B1          ldr     d17, [fp,#40]	// [V00 arg0]
        1F518601          fmsub   d1, d16, d17, d1
        97FFC36C          bl      MathFusedMultiplyAddTest.Program:CompareDoubles(double,double)
						;; bbWeight=1    PerfScore 16.50
G_M65527_IG03:
        A8C37BFD          ldp     fp, lr, [sp],#48
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 2.00

; Total bytes of code 60, prolog size 8, PerfScore 26.00, (MethodHash=95450008) for method MathFusedMultiplyAddTest.Program:Check5(double,double,double)
; ============================================================

Makes LSRA to prefer op1Reg for targetReg of intrinsics that have "simd-to-simd move" semantics. For example, both Vector64.CreateScalarUnsafe(float) and AdvSimd.Arm64.DuplicateToVector64(double) (this is what Vector64.Create(double) gets lowered to) have this semantics. With that change the code for Check5 have one less instruction:

; Assembly listing for method MathFusedMultiplyAddTest.Program:Check5(double,double,double)
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,  4   )  double  ->  [fp+0x28]  
;  V01 arg1         [V01,T01] (  4,  4   )  double  ->  [fp+0x20]  
;  V02 arg2         [V02,T02] (  4,  4   )  double  ->  [fp+0x18]  
;# V03 OutArgs      [V03    ] (  1,  1   )  lclBlk ( 0) [sp+0x00]   "OutgoingArgSpace"
;  V04 tmp1         [V04,T03] (  2,  4   )  double  ->   d0         "argument with side effect"
;
; Lcl frame size = 32

G_M65527_IG01:
        A9BD7BFD          stp     fp, lr, [sp,#-48]!
        910003FD          mov     fp, sp
						;; bbWeight=1    PerfScore 1.50
G_M65527_IG02:
        FD0013A1          str     d1, [fp,#32]
        1E614021          fneg    d1, d1
        FD0017A0          str     d0, [fp,#40]
        FD000FA2          str     d2, [fp,#24]
        97FFC37E          bl      MathFusedMultiplyAddTest.Program:ReferenceMultiplyAdd(double,double,double):double
        FD400FA2          ldr     d2, [fp,#24]	// [V02 arg2]
        FD4013A1          ldr     d1, [fp,#32]	// [V01 arg1]
        FD4017B0          ldr     d16, [fp,#40]	// [V00 arg0]
        1F508821          fmsub   d1, d1, d16, d2
        97FFC37F          bl      MathFusedMultiplyAddTest.Program:CompareDoubles(double,double)
						;; bbWeight=1    PerfScore 16.00
G_M65527_IG03:
        A8C37BFD          ldp     fp, lr, [sp],#48
        D65F03C0          ret     lr
						;; bbWeight=1    PerfScore 2.00

; Total bytes of code 56, prolog size 8, PerfScore 25.10, (MethodHash=95450008) for method MathFusedMultiplyAddTest.Program:Check5(double,double,double)
; ============================================================

Thanks @CarolEidt for proposing the solution to the issue with redundant fmov instruction.

The following are JitDisasm for methods in MathFusedMultiplyAdd_ro after all the changes (including the ones to LSRA).
MathFusedMultiplyAdd_ro-After2.txt

AdvSimd.FusedMultiplyAddScalar( Vector64.CreateScalarUnsafe(z), Vector64.CreateScalarUnsafe(y), Vector64.CreateScalarUnsafe(x)).ToScalar() on Arm64 in importer.cpp

…h lowerarmarch.cpp

echesakov · 2020-08-06T00:13:58Z

@dotnet/jit-contrib @tannergooding I think this should be ready for review now.

There is no codegen changes across all frameworks libraries since there no uses of MathF.FusedMultiplyAdd or Math.FusedMultiplyAdd.

…aarm64.cpp

kunalspathak · 2020-08-06T17:54:00Z

src/coreclr/src/jit/lsraarm64.cpp

+        {
+            simdRegToSimdRegMove = (intrin.op1->TypeGet() == TYP_DOUBLE);
+        }
+        else if ((intrin.id == NI_Vector64_ToScalar) || (intrin.id == NI_Vector128_ToScalar))


intrin.id == NI_Vector64_ToScalar [](start = 18, length = 33)

should we also special case for GetElement(0)

Yes, we could this for GetElement(0) and some others (e.g. AdvSimd.Extract with 0 index).
I decided not to do optimization part at the moment and keep such changes for .NET 6.0.

The reason why I changed LSRA in this PR - because treating MathF.FusedMultiplyAdd as an intrincis introduced the annoying and unnecessary fmov (as you can see in the codegen diffs) and I wanted to avoid that.

Would it make sense to open an issue for .NET 6 to ensure that we are doing this for all intrinsics that have move/copy semantics?

There are likely a lot of places where we could introduce simple optimizations for things like extractions or shuffles (on both x86 and ARM).

LLVM/MSVC do many of these already and it might be good to construct a general list of them for potential .NET 6 work

CarolEidt

LGTM

CarolEidt · 2020-08-06T18:52:06Z

src/coreclr/src/jit/lsraarm64.cpp

+        {
+            simdRegToSimdRegMove = (intrin.op1->TypeGet() == TYP_DOUBLE);
+        }
+        else if ((intrin.id == NI_Vector64_ToScalar) || (intrin.id == NI_Vector128_ToScalar))


Would it make sense to open an issue for .NET 6 to ensure that we are doing this for all intrinsics that have move/copy semantics?

tannergooding · 2020-08-06T19:34:04Z

src/coreclr/src/jit/lsraarm64.cpp

+
+        // If we have an RMW intrinsic or an intrinsic with simple move semantic between two SIMD registers,
+        // we want to preference op1Reg to the target if op1 is not contained.
+        if (isRMW || simdRegToSimdRegMove)


simdRegToSimdRegMove is to generate better code when the value is lastUse, right?

Do we want a similar optimization for x86 and should we log an issue to track it?

simdRegToSimdRegMove is to generate better code when the value is lastUse, right?

Correct, if the op1Reg is not last use than LSRA should pick a different register for targetReg.

Do we want a similar optimization for x86 and should we log an issue to track it?

Yes, and I agree with @CarolEidt that we need a separate issue to track all of them for .NET 6.

echesakov · 2020-08-06T20:16:29Z

Opened #40489

* Transform Math{F}.FusedMultiplyAdd(x,y,z) into AdvSimd.FusedMultiplyAddScalar( Vector64.CreateScalarUnsafe(z), Vector64.CreateScalarUnsafe(y), Vector64.CreateScalarUnsafe(x)).ToScalar() on Arm64 in importer.cpp * Add containment analysis for AdvSimd_FusedMultiplyAddScalar in lower.h lowerarmarch.cpp * Set tgtPrefOp1 for intrinsics with SIMD-to-SIMD move semantics in lsraarm64.cpp

echesakov added 2 commits July 29, 2020 20:43

Transform Math{F}.FusedMultiplyAdd(x,y,z) into

b8f84d1

AdvSimd.FusedMultiplyAddScalar( Vector64.CreateScalarUnsafe(z), Vector64.CreateScalarUnsafe(y), Vector64.CreateScalarUnsafe(x)).ToScalar() on Arm64 in importer.cpp

Add containment analysis for AdvSimd_FusedMultiplyAddScalar in lower.…

90d7d8d

…h lowerarmarch.cpp

Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 30, 2020

echesakov added the arch-arm64 label Jul 30, 2020

echesakov marked this pull request as ready for review August 6, 2020 00:10

dotnet deleted a comment from azure-pipelines bot Aug 6, 2020

Set tgtPrefOp1 for intrinsics with SIMD-to-SIMD move semantics in lsr…

833aaba

…aarm64.cpp

echesakov force-pushed the Runtime_40078 branch from ed89d56 to 833aaba Compare August 6, 2020 00:31

kunalspathak reviewed Aug 6, 2020

View reviewed changes

kunalspathak approved these changes Aug 6, 2020

View reviewed changes

CarolEidt approved these changes Aug 6, 2020

View reviewed changes

tannergooding reviewed Aug 6, 2020

View reviewed changes

tannergooding approved these changes Aug 6, 2020

View reviewed changes

echesakov mentioned this pull request Aug 6, 2020

Consider optimizing more intrinsics that have move/copy semantics #40489

Open

echesakov merged commit a1da393 into dotnet:master Aug 6, 2020

echesakov deleted the Runtime_40078 branch August 6, 2020 21:57

karelz added this to the 5.0.0 milestone Aug 18, 2020

ghost locked as resolved and limited conversation to collaborators Dec 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Arm64] Treat Math/MathF.FusedMultiplyAdd as intrinsics#40124

[Arm64] Treat Math/MathF.FusedMultiplyAdd as intrinsics#40124
echesakov merged 3 commits intodotnet:masterfrom
echesakov:Runtime_40078

echesakov commented Jul 30, 2020 •

edited

Loading

Uh oh!

echesakov commented Aug 6, 2020

Uh oh!

kunalspathak Aug 6, 2020

Uh oh!

echesakov Aug 6, 2020

Uh oh!

CarolEidt Aug 6, 2020

Uh oh!

tannergooding Aug 6, 2020

Uh oh!

CarolEidt left a comment

Uh oh!

CarolEidt Aug 6, 2020

Uh oh!

tannergooding Aug 6, 2020

Uh oh!

echesakov Aug 6, 2020

Uh oh!

echesakov commented Aug 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

echesakov commented Jul 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

echesakov commented Aug 6, 2020

Uh oh!

kunalspathak Aug 6, 2020

Choose a reason for hiding this comment

Uh oh!

echesakov Aug 6, 2020

Choose a reason for hiding this comment

Uh oh!

CarolEidt Aug 6, 2020

Choose a reason for hiding this comment

Uh oh!

tannergooding Aug 6, 2020

Choose a reason for hiding this comment

Uh oh!

CarolEidt left a comment

Choose a reason for hiding this comment

Uh oh!

CarolEidt Aug 6, 2020

Choose a reason for hiding this comment

Uh oh!

tannergooding Aug 6, 2020

Choose a reason for hiding this comment

Uh oh!

echesakov Aug 6, 2020

Choose a reason for hiding this comment

Uh oh!

echesakov commented Aug 6, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

echesakov commented Jul 30, 2020 •

edited

Loading