Skip to content

Commit 90908d5

Browse files
Changing the logic for how we deal with RegexOptions.IgnoreCase matching. (#67184)
* Changing the logic for how we deal with RegexOptions.IgnoreCase matching. * Addressing first round of feedback * Addressing more feedback. * - Ensure that Backreferences use the same case behavior that the casing table does when using IgnoreCase. - Addressing more feedback. * Apply suggestions from code review Co-authored-by: Stephen Toub <stoub@microsoft.com> * Address more feedback * Fix allocation regression for patterns with a lot of ascii letters * Skip few tests in Browser and .NET Framework * Skip one more test that shouldn't be ran on wasm * Address more PR Feedback * More feedback * Skip tests that are failing in NLS-globalization queues Co-authored-by: Stephen Toub <stoub@microsoft.com>
1 parent b4c76da commit 90908d5

44 files changed

Lines changed: 2281 additions & 1800 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

src/libraries/System.Private.CoreLib/Tools/GenUnicodeProp/Updating-Unicode-Versions.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# Instructions for updating Unicode version in dotnet/runtime
22

33
## Table of Contents
4+
45
- [Instructions for updating Unicode version in dotnet/runtime](#instructions-for-updating-unicode-version-in-dotnetruntime)
56
- [Table of Contents](#table-of-contents)
67
- [Overview](#overview)
@@ -24,8 +25,7 @@ This repository has several places that need to be updated when we are ingesting
2425
- extracted/DerivedBidiClass.txt
2526
- extracted/DerivedName.txt
2627

27-
2. Once you have downloaded all those files, create a fork of the repo https://github.com/dotnet/runtime-assets and send a PR which creates a folder at `src/System.Private.Runtime.UnicodeData/<YourUnicodeVersion>` and places all of the downloaded files from step 1 there. You can look at a sample PR that did this for Unicode 14.0.0 here: https://github.com/dotnet/runtime-assets/pull/179
28-
28+
2. Once you have downloaded all those files, create a fork of the repo <https://github.com/dotnet/runtime-assets> and send a PR which creates a folder at `src/System.Private.Runtime.UnicodeData/<YourUnicodeVersion>` and places all of the downloaded files from step 1 there. You can look at a sample PR that did this for Unicode 14.0.0 here: <https://github.com/dotnet/runtime-assets/pull/179>
2929

3030
## Ingest the created package into dotnet/runtime repo
3131

@@ -42,6 +42,6 @@ This should be done automatically by dependency-flow, so in theory there shouldn
4242
- System.Globalization.Nls.Tests.csproj
4343
- System.Text.Encodings.Web.Tests.csproj
4444
4. If the new Unicode data contains casing changes/updates, then we will also need to update `src/coreclr/pal/src/locale/unicodedata.cpp` file. This file is used by most of the reflection stack whenever you specify the `BindingFlags.IgnoreCase`. In order to regenerate the contents of the `unicdedata.cpp` file, you need to run the Program located at `src/coreclr/pal/src/locale/unicodedata.cs` and give a full path to the new UnicodeData.txt as a parameter.
45-
5. If the new Unicode data made changes on what character class a specific character belongs to, or added new characters, you may need to update the serialized Unicode character classes data in `System.Text.RegularExpressions` for the `NonBacktracking` engine. The telling sign that will show you if you need to do this, is if any tests are failing in the `System.Text.RegularExpressions.Tests` test project. In case some tests do fail (which means you need to update the serialized mappings), you will need to edit the file `src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/RegexExperiment.cs` and set the `Enabled` bool to `true`, and re-run the RegexTests. This will generate a couple of files in your `%temp%` directory: `IgnoreCaseRelation.cs` and `UnicodeCategoryRanges.cs`. These files will need to be copied (and overwrite the existing ones) to the folder `src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Symbolic/Unicode/`
45+
5. Update the Regex casing equivalence table using the UnicodeData.txt file from the new Unicode version. You can find the instructions on how to do this [here](../../../System.Text.RegularExpressions/tools/Readme.md).
4646
6. Finally, last step is to update the license for the Unicode data into our [Third party notices](../../../../../THIRD-PARTY-NOTICES.TXT) by copying the contents located in `https://www.unicode.org/license.html` to the section that has the Unicode license in our notices.
4747
7. That's it, now commit all of the changed files, and send a PR into dotnet/runtime with the updates. If there were any special things you had to do that are not noted on this document, PLEASE UPDATE THESE INSTRUCTIONS to facilitate future updates.

src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs

Lines changed: 94 additions & 188 deletions
Large diffs are not rendered by default.

src/libraries/System.Text.RegularExpressions/gen/System.Text.RegularExpressions.Generator.csproj

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,8 +33,10 @@
3333
<Compile Include="$(CoreLibSharedDir)System\Collections\Generic\ValueListBuilder.cs" Link="Production\ValueListBuilder.cs" />
3434
<Compile Include="..\src\System\Collections\Generic\ValueListBuilder.Pop.cs" Link="Production\ValueListBuilder.Pop.cs" />
3535
<Compile Include="..\src\System\Threading\StackHelper.cs" Link="Production\StackHelper.cs" />
36+
<Compile Include="..\src\System\Text\RegularExpressions\RegexCaseEquivalences.Data.cs" Link="Production\RegexCaseEquivalences.Data.cs" />
37+
<Compile Include="..\src\System\Text\RegularExpressions\RegexCaseEquivalences.cs" Link="Production\RegexCaseEquivalences.cs" />
38+
<Compile Include="..\src\System\Text\RegularExpressions\RegexCaseBehavior.cs" Link="Production\RegexCaseBehavior.cs" />
3639
<Compile Include="..\src\System\Text\RegularExpressions\RegexCharClass.cs" Link="Production\RegexCharClass.cs" />
37-
<Compile Include="..\src\System\Text\RegularExpressions\RegexCharClass.MappingTable.cs" Link="Production\RegexCharClass.MappingTable.cs" />
3840
<Compile Include="..\src\System\Text\RegularExpressions\RegexFindOptimizations.cs" Link="Production\RegexFindOptimizations.cs" />
3941
<Compile Include="..\src\System\Text\RegularExpressions\RegexNode.cs" Link="Production\RegexNode.cs" />
4042
<Compile Include="..\src\System\Text\RegularExpressions\RegexNodeKind.cs" Link="Production\RegexNodeKind.cs" />

src/libraries/System.Text.RegularExpressions/src/System.Text.RegularExpressions.csproj

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,10 @@
2424
<Compile Include="System\Text\RegularExpressions\Regex.Replace.cs" />
2525
<Compile Include="System\Text\RegularExpressions\Regex.Split.cs" />
2626
<Compile Include="System\Text\RegularExpressions\Regex.Timeout.cs" />
27+
<Compile Include="System\Text\RegularExpressions\RegexCaseBehavior.cs" />
28+
<Compile Include="System\Text\RegularExpressions\RegexCaseEquivalences.Data.cs" />
29+
<Compile Include="System\Text\RegularExpressions\RegexCaseEquivalences.cs" />
2730
<Compile Include="System\Text\RegularExpressions\RegexCharClass.cs" />
28-
<Compile Include="System\Text\RegularExpressions\RegexCharClass.MappingTable.cs" />
2931
<Compile Include="System\Text\RegularExpressions\RegexCompilationInfo.cs" />
3032
<Compile Include="System\Text\RegularExpressions\RegexFindOptimizations.cs" />
3133
<Compile Include="System\Text\RegularExpressions\RegexGeneratorAttribute.cs" />
@@ -83,10 +85,6 @@
8385
<Compile Include="System\Text\RegularExpressions\Symbolic\SymbolicRegexSet.cs" />
8486
<Compile Include="System\Text\RegularExpressions\Symbolic\TransitionRegex.cs" />
8587
<Compile Include="System\Text\RegularExpressions\Symbolic\TransitionRegexKind.cs" />
86-
<Compile Include="System\Text\RegularExpressions\Symbolic\Unicode\GeneratorHelper.cs" />
87-
<Compile Include="System\Text\RegularExpressions\Symbolic\Unicode\IgnoreCaseRelation.cs" />
88-
<Compile Include="System\Text\RegularExpressions\Symbolic\Unicode\IgnoreCaseRelationGenerator.cs" />
89-
<Compile Include="System\Text\RegularExpressions\Symbolic\Unicode\IgnoreCaseTransformer.cs" />
9088
<Compile Include="System\Text\RegularExpressions\Symbolic\Unicode\UnicodeCategoryRanges.cs" />
9189
<Compile Include="System\Text\RegularExpressions\Symbolic\Unicode\UnicodeCategoryRangesGenerator.cs" />
9290
<Compile Include="System\Text\RegularExpressions\Symbolic\Unicode\UnicodeCategoryTheory.cs" />

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/CompiledRegexRunner.cs

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,22 @@
11
// Licensed to the .NET Foundation under one or more agreements.
22
// The .NET Foundation licenses this file to you under the MIT license.
33

4+
using System.Globalization;
5+
46
namespace System.Text.RegularExpressions
57
{
68
internal sealed class CompiledRegexRunner : RegexRunner
79
{
810
private readonly ScanDelegate _scanMethod;
11+
/// <summary>This field will only be set if the pattern contains backreferences and has RegexOptions.IgnoreCase</summary>
12+
private readonly TextInfo? _textInfo;
913

1014
internal delegate void ScanDelegate(RegexRunner runner, ReadOnlySpan<char> text);
1115

12-
public CompiledRegexRunner(ScanDelegate scan)
16+
public CompiledRegexRunner(ScanDelegate scan, CultureInfo? culture)
1317
{
1418
_scanMethod = scan;
19+
_textInfo = culture?.TextInfo;
1520
}
1621

1722
protected internal override void Scan(ReadOnlySpan<char> text)
Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,28 @@
11
// Licensed to the .NET Foundation under one or more agreements.
22
// The .NET Foundation licenses this file to you under the MIT license.
33

4+
using System.Globalization;
45
using System.Reflection.Emit;
56

67
namespace System.Text.RegularExpressions
78
{
89
internal sealed class CompiledRegexRunnerFactory : RegexRunnerFactory
910
{
1011
private readonly DynamicMethod _scanMethod;
12+
/// <summary>This field will only be set if the pattern has backreferences and uses RegexOptions.IgnoreCase</summary>
13+
private readonly CultureInfo? _culture;
1114

1215
// Delegate is lazily created to avoid forcing JIT'ing until the regex is actually executed.
1316
private CompiledRegexRunner.ScanDelegate? _scan;
1417

15-
public CompiledRegexRunnerFactory(DynamicMethod scanMethod)
18+
public CompiledRegexRunnerFactory(DynamicMethod scanMethod, CultureInfo? culture)
1619
{
1720
_scanMethod = scanMethod;
21+
_culture = culture;
1822
}
1923

2024
protected internal override RegexRunner CreateInstance() =>
2125
new CompiledRegexRunner(
22-
_scan ??= _scanMethod.CreateDelegate<CompiledRegexRunner.ScanDelegate>());
26+
_scan ??= _scanMethod.CreateDelegate<CompiledRegexRunner.ScanDelegate>(), _culture);
2327
}
2428
}

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.Debug.cs

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -44,13 +44,12 @@ internal void SaveDGML(TextWriter writer, bool nfa, bool addDotStar, bool revers
4444
}
4545

4646
/// <summary>
47-
/// Generates two files IgnoreCaseRelation.cs and UnicodeCategoryRanges.cs for the namespace System.Text.RegularExpressions.Symbolic.Unicode
47+
/// Generates UnicodeCategoryRanges.cs for the namespace System.Text.RegularExpressions.Symbolic.Unicode
4848
/// in the given directory path. Only avaliable in DEBUG mode.
4949
/// </summary>
5050
[ExcludeFromCodeCoverage(Justification = "Debug only")]
5151
internal static void GenerateUnicodeTables(string path)
5252
{
53-
IgnoreCaseRelationGenerator.Generate("System.Text.RegularExpressions.Symbolic.Unicode", "IgnoreCaseRelation", path);
5453
UnicodeCategoryRangesGenerator.Generate("System.Text.RegularExpressions.Symbolic.Unicode", "UnicodeCategoryRanges", path);
5554
}
5655

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/Regex.cs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ internal Regex(string pattern, CultureInfo? culture)
6767
RegexTree tree = Init(pattern, RegexOptions.None, s_defaultMatchTimeout, ref culture);
6868

6969
// Create the interpreter factory.
70-
factory = new RegexInterpreterFactory(tree, culture);
70+
factory = new RegexInterpreterFactory(tree);
7171

7272
// NOTE: This overload _does not_ delegate to the one that takes options, in order
7373
// to avoid unnecessarily rooting the support for RegexOptions.NonBacktracking/Compiler
@@ -101,7 +101,7 @@ internal Regex(string pattern, RegexOptions options, TimeSpan matchTimeout, Cult
101101
}
102102

103103
// If no factory was created, fall back to creating one for the interpreter.
104-
factory ??= new RegexInterpreterFactory(tree, culture);
104+
factory ??= new RegexInterpreterFactory(tree);
105105
}
106106
}
107107

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
// Licensed to the .NET Foundation under one or more agreements.
2+
// The .NET Foundation licenses this file to you under the MIT license.
3+
4+
using System.Globalization;
5+
6+
namespace System.Text.RegularExpressions
7+
{
8+
/// <summary>
9+
/// When a regular expression specifies the option <see cref="RegexOptions.IgnoreCase"/> then comparisons between the input and the
10+
/// pattern will made case-insensitively. In order to support this, we need to define which case mappings shall be used for the comparisons.
11+
/// A case mapping exists whenever you have two characters 'A' and 'B', where either 'A' is the ToLower() representation of 'B' or both 'A' and 'B' lowercase to the
12+
/// same character. Note that we don't consider a mapping when the only relationship between 'A' and 'B' is that one is the ToUpper() representation of the other. This
13+
/// is for backwards compatibility since, in Regex, we have only consider ToLower() for case insensitive comparisons. Given the case mappings vary depending on the culture,
14+
/// Regex supports 3 main different behaviors or mappings: Invariant, NonTurkish, and Turkish. This is in order to match the behavior of all .NET supported cultures
15+
/// current behavior for ToLower(). As a side note, there should be no cases where 'A'.ToLower() == 'B' but 'A'.ToLower() != 'B'.ToLower(). This aspect is important since
16+
/// for backreferences we make use a.ToLower() == b.ToLower() for comparisons so if there was such a case then it would lead to inconsistencies between how we handle
17+
/// backreferences vs how we handle other case insensitive comparisons.
18+
/// </summary>
19+
internal enum RegexCaseBehavior
20+
{
21+
/// <summary>
22+
/// Invariant case-mappings are used. This includes all of the common mappings across cultures. This behavior is used when either the user
23+
/// specified <see cref="RegexOptions.CultureInvariant"/> or when the CurrentCulture is <see cref="CultureInfo.InvariantCulture"/>.
24+
/// </summary>
25+
Invariant,
26+
27+
/// <summary>
28+
/// These are all the same mappings used by Invariant behavior, with an additional one: \u0130 => \u0069
29+
/// This mode will be used when CurrentCulture is not Invariant or any of the tr/az cultures.
30+
/// </summary>
31+
NonTurkish,
32+
33+
/// <summary>
34+
/// These are all the same mappings used by non-Turkish behavior, with the exception of: \u0049 => \u0069 which mapping doesn't exist
35+
/// on this behavior and with the additional mapping of: \u0069 => \u0131. This mode will be used when CurrentCulture is any of the tr/az cultures.
36+
/// </summary>
37+
Turkish
38+
}
39+
}

0 commit comments

Comments
 (0)