port SpanHelpers.IndexOf(ref byte, byte, int) to Vector128/256#73364
port SpanHelpers.IndexOf(ref byte, byte, int) to Vector128/256#73364adamsitnik merged 2 commits intodotnet:mainfrom
Conversation
|
Tagging subscribers to this area: @dotnet/area-system-memory Issue Detailsx6410% improvement for AVX2, no re regressions for AVX and no HI. DetailsBenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
[Host] : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
Job-BJEYEU : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-KDQZPU : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX2
BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
[Host] : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
Job-SUXHIF : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
Job-KVLSYC : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
EnvironmentVariables=COMPlus_EnableAVX2=0
BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.6.22352.1
[Host] : .NET 7.0.0 (7.0.22.32404), X64 RyuJIT AVX2
Job-KLMBAP : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
Job-KSWPBM : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
EnvironmentVariables=COMPlus_EnableHWIntrinsic=0
arm64The initial implementation got x2 perf hit, but after I've moved the call to BDN reports half a nanosecond difference, which translates to 2-3% and I think that we should just ignore it (it's within the range of error). DetailsBenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22403.8
[Host] : .NET 7.0.0 (7.0.22.40210), Arm64 RyuJIT AdvSIMD
Job-SUFTPF : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-WFBVQA : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
contributes to #64451
|
| } | ||
|
|
||
| // Find bitflag offset of first match and add to current offset | ||
| uint matches = compareResult.ExtractMostSignificantBits(); |
There was a problem hiding this comment.
Last time I tried to do the same it was a noticeable regression for ARM64
There was a problem hiding this comment.
Now this operation is performed only once, after we find a match (not once for every vector we compare)
There was a problem hiding this comment.
I think I did it too 🙂 But maybe it was before I did #65632
There was a problem hiding this comment.
What is the cost of this approach vs doing it every loop for Vector256<T>
Is it better, particularly for large inputs, to do this there as well?
x64
10% improvement for AVX2, no re regressions for AVX and no HI.
Details
arm64
The initial implementation got x2 perf hit, but after I've moved the call to
ExtractMostSignificantBitsto be performed only when a match is found, the perf is on par (see the second commit).BDN reports half a nanosecond difference, which translates to 2-3% and I think that we should just ignore it (it's within the range of error).
Details
contributes to #64451