arm64: Optimize ARM64 compare mask ExtractMostSignificantBits consumers#129688
arm64: Optimize ARM64 compare mask ExtractMostSignificantBits consumers#129688jonathandavies-arm wants to merge 3 commits into
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
|
Isn't it the same as #126790 ? |
@tannergooding Optimized all We still have a few places where we call Compare + EMSB instead of explicit IndexOf APIs, that is what my other PR tries to do: #126841 There will be a few places where we won't be able to replaces EMSB with IndexOf (e.g. iterators over multiple matches), though. But I don't think this PR does a good job for them either judging by the diffs - Basically, +600 LOC of JIT changes for 1 use-case (context) in benchmarks. |
|
If you want to make this transformation more useful, you need to borrow the assertionprop.cpp changes from #126790 - |
Teach assertion propagation on ARM64 to recognize Vector64/128 ExtractMostSignificantBits inputs whose value numbers represent per-element boolean masks. The helper recognizes comparison masks, all-zero/all-bits constants, boolean-preserving operations, and reaching PHI values, and marks the EMSB node with a HW intrinsic flag. Consume the flag in rationalization so existing ExtractMostSignificantBits rewrites can handle mask values that have flowed through locals, while keeping unsupported element types filtered out. Add coverage for a comparison mask stored in a local before PopCount(ExtractMostSignificantBits()).
|
A lot bigger diffs now. |
| case GT_LE: | ||
| case GT_LT: | ||
| { | ||
| return varTypeIsIntegral(baseType) && (genTypeSize(baseType) <= genTypeSize(intrinsicSimdBaseType)); |
There was a problem hiding this comment.
I'd leave a comment for <=, the original impl had it.
|
|
||
| if (comp->vnStore->VNVisitReachingVNs(op1VN, vnVisitor) == ValueNumStore::VNVisit::Continue) | ||
| { | ||
| tree->gtFlags |= GTF_HW_ZERO_OR_ALL_BITS_SET; |
There was a problem hiding this comment.
You need to fix GenTree::Compare - it should take this flag into account and return false if the rest is the same.
| // optAssertionProp_HWIntrinsic: Propagate VN-derived facts to hwintrinsic tree flags. | ||
| // | ||
| // Arguments: | ||
| // tree - The hwintrinsic node |
| // Return Value: | ||
| // True if the hwintrinsic produces a mask where each SIMD element is either all-bits-set or zero. | ||
| // | ||
| static bool IsHWIntrinsicCmpMask(NamedIntrinsic intrinsic) |
There was a problem hiding this comment.
should be shared with assertionprop
| // Return Value: | ||
| // The normalized SIMD base type, or TYP_UNDEF if it is unsupported. | ||
| // | ||
| static var_types NormalizeCmpMaskSimdBaseType(var_types simdBaseType) |
There was a problem hiding this comment.
Yeah. There's a getIndexTypeForShuffle. I think it can be renamed to a more general term if we are going to reuse it here.
| node = *useEdge; | ||
| } | ||
| } | ||
| else if (node->AsIntrinsic()->gtIntrinsicName == NI_PRIMITIVE_TrailingZeroCount) |
There was a problem hiding this comment.
Do we need to count LeadingZeroCount as well? (e.g. last index of I assume)

See discussion at #121981 (comment)
SuperPMI ASM diffs: ExtractMostSignificantBits
Base JIT:
artifacts/asmdiff/builds/59979e64/core_root/libclrjit.soDiff JIT:
artifacts/tests/coreclr/linux.arm64.Checked/Tests/Core_Root/libclrjit.soMCH:
/tmp/ExtractMostSignificantBits_final.mchBase commit:
59979e6401a(origin/main)Diff commit:
e37c3840564Short summary
Diffs are based on 39 contexts (0 MinOpts, 39 FullOpts).
Overall (-480 bytes)
FullOpts (-480 bytes)
SuperPMI summary
Diffs are based on 39 contexts (0 MinOpts, 39 FullOpts).
Overall (-480 bytes)
FullOpts (-480 bytes)
Example diffs
ExtractMostSignificantBits_final.mch
-44 (-52.38%) : 17.dasm - TestExtractMostSignificantBits.Program:CountGreaterThanOrEqualByte(System.Runtime.Intrinsics.Vector128`1[byte],byte):int (FullOpts)
-32 (-47.06%) : 12.dasm - TestExtractMostSignificantBits.Program:CountLessThanInt32(System.Runtime.Intrinsics.Vector128`1[int],int):int (FullOpts)
-32 (-47.06%) : 27.dasm - TestExtractMostSignificantBits.Program:CountLessThanInt3264(System.Runtime.Intrinsics.Vector64`1[int],int):int (FullOpts)
-8 (-13.33%) : 33.dasm - TestExtractMostSignificantBits.Program:IndexOfFirstGreaterThanOrEqualByte64(System.Runtime.Intrinsics.Vector64`1[byte],byte):int (FullOpts)
-8 (-13.33%) : 23.dasm - TestExtractMostSignificantBits.Program:IndexOfFirstGreaterThanOrEqualUInt1664(System.Runtime.Intrinsics.Vector64`1[ushort],ushort):int (FullOpts)
-8 (-13.33%) : 8.dasm - TestExtractMostSignificantBits.Program:IndexOfFirstGreaterThanOrEqualUInt16(System.Runtime.Intrinsics.Vector128`1[ushort],ushort):int (FullOpts)
Details
Size improvements/regressions per collection
PerfScore improvements/regressions per collection
Context information
jit-analyze output
ExtractMostSignificantBits_final.mch
Detail diffs
Jit-analyze summary
Detail diffs
Generated artifacts
artifacts/spmi/superpmi.27.logartifacts/spmi/diff_short_summary.11.mdartifacts/spmi/diff_summary.11.mdartifacts/spmi/asm.ExtractMostSignificantBits_final/summary.mdartifacts/spmi/asm.ExtractMostSignificantBits_final/base/artifacts/spmi/asm.ExtractMostSignificantBits_final/diff/