Simplify UTF-16 validation Vector128 codepath by ylpoonlg · Pull Request #121981 · dotnet/runtime

ylpoonlg · 2025-11-26T10:12:44Z

Re-attempt at #121383.

Refactor the vectorized code path by combining the SSE2 "intrinsified" path with the original Vector128 algorithm. There are still some platform specific code (for AdvSimd), as it is difficult to fully rely on Vector128 APIs without sacrificing performance too much. The main issue is the lack of an instruction for Vector128.ExtractMostSignificantBits on Arm, so it is significantly slower when trying to force it to use the same mask format as the SSE2 algorithm. I have looked into the possibility of using IndexOf and Count etc, but they also use ExtractMostSignificantBits so it poses the same problem.
This PR tries to encapsulate this difference in a few helper methods so they can share the same code path for the main algorithm.

Performance wise, there is not as much improvements, but hopefully the code will be easier to maintain.

Arm Neoverse-V2:

Method	Input	Version	Mean	Error	Ratio
GetByteCount	EnglishAllAscii	Before	4.437 us	0.0437 us	1.000
GetByteCount	EnglishAllAscii	After	4.475 us	0.1618 us	1.009
GetByteCount	EnglishMostlyAscii	Before	20.387 us	0.1744 us	1.000
GetByteCount	EnglishMostlyAscii	After	19.941 us	0.1079 us	0.978
GetByteCount	Chinese	Before	9.145 us	0.0072 us	1.000
GetByteCount	Chinese	After	8.992 us	0.0069 us	0.983
GetByteCount	Cyrillic	Before	7.936 us	0.0095 us	1.000
GetByteCount	Cyrillic	After	7.812 us	0.0056 us	0.984
GetByteCount	Greek	Before	10.077 us	0.0106 us	1.000
GetByteCount	Greek	After	9.952 us	0.0120 us	0.988

Intel Sapphire Rapids:

Method	Input	Version	Mean	Error	Ratio
GetByteCount	EnglishAllAscii	Before	8.144 us	0.3398 us	1.000
GetByteCount	EnglishAllAscii	After	8.126 us	0.2759 us	0.998
GetByteCount	EnglishMostlyAscii	Before	22.971 us	0.4046 us	1.000
GetByteCount	EnglishMostlyAscii	After	22.155 us	0.9902 us	0.964
GetByteCount	Chinese	Before	10.582 us	0.3425 us	1.000
GetByteCount	Chinese	After	10.048 us	0.2135 us	0.950
GetByteCount	Cyrillic	Before	9.222 us	0.1874 us	1.000
GetByteCount	Cyrillic	After	9.100 us	0.2704 us	0.987
GetByteCount	Greek	Before	11.802 us	0.3551 us	1.000
GetByteCount	Greek	After	11.224 us	0.3505 us	0.951

Combine the SSE2 codepath with a more generic Vector128 algorithm. AdvSimd is handled slightly differently to avoid using Vector128 ExtractMostSignificantBits, because there is no such equivalent instruction on Arm so the performance would be very slow otherwise.

ylpoonlg · 2025-11-26T10:15:08Z

cc @dotnet/arm64-contrib @a74nh @SwapnilGaikwad @tannergooding @EgorBo

EgorBo · 2026-01-13T17:24:55Z

Improvements: dotnet/perf-autofiling-issues#67360

dotnet-policy-service Bot added the community-contribution Indicates that the PR has been added by a community member label Nov 26, 2025

github-actions Bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Nov 26, 2025

ylpoonlg marked this pull request as ready for review November 26, 2025 10:15

EgorBo reviewed Nov 26, 2025

View reviewed changes

Comment thread src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs

tannergooding approved these changes Dec 2, 2025

View reviewed changes

tannergooding merged commit ac7db14 into dotnet:main Dec 2, 2025
144 checks passed

dotnet-maestro Bot mentioned this pull request Dec 3, 2025

[main] Source code updates from dotnet/runtime dotnet/dotnet#3448

Merged

github-actions Bot locked and limited conversation to collaborators Jan 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Simplify UTF-16 validation Vector128 codepath#121981

Simplify UTF-16 validation Vector128 codepath#121981
tannergooding merged 1 commit into
dotnet:mainfrom
ylpoonlg:github-utf16-validation

ylpoonlg commented Nov 26, 2025

Uh oh!

ylpoonlg commented Nov 26, 2025

Uh oh!

Uh oh!

Uh oh!

EgorBo commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

ylpoonlg commented Nov 26, 2025

Uh oh!

ylpoonlg commented Nov 26, 2025

Uh oh!

Uh oh!

Uh oh!

EgorBo commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants