Sanitize control/format characters in console logger output across all formatters#128741
Sanitize control/format characters in console logger output across all formatters#128741Copilot wants to merge 10 commits into
Conversation
|
Tagging subscribers to this area: @dotnet/area-extensions-logging |
Co-authored-by: rosebyte <14963300+rosebyte@users.noreply.github.com>
tarekgh
left a comment
There was a problem hiding this comment.
Review of the control character sanitization changes
The security motivation here is solid. Preventing terminal escape injection (ANSI sequences, bidi overrides, etc.) is worth doing. But the current implementation has several problems that need to be addressed before merging.
\n, \r, and \t should not be escaped
The sanitizer uses UnicodeCategory.Control which catches every character in U+0000-U+001F, including \n, \r, and \t. These are not security threats. They are structural formatting characters that the formatters depend on.
Both SimpleConsoleFormatter and SystemdConsoleFormatter have explicit downstream logic that operates on real newlines:
- SimpleConsoleFormatter.WriteMessage calls
message.Replace(Environment.NewLine, _newLineWithMessagePadding)to add indentation padding after each newline in exception text. - SystemdConsoleFormatter.WriteReplacingNewLine calls
message.Replace(Environment.NewLine, " ")to flatten multi-line messages into a single line (required by systemd/journald).
Because the sanitizer runs before these calls, it converts \n to the literal text \u000A. The downstream Replace calls then find no real newlines and become no-ops. This breaks multi-line exception formatting in Simple mode (no padding) and breaks the single-line guarantee in Systemd mode.
The fix should target only the actually dangerous characters: ESC (\x1B), BEL (\x07), backspace (\x08), bidi overrides (\u202E, \u202D), and similar. Not \n/\r/\t.
Double-escaping on the JSON path
Utf8JsonWriter with JavaScriptEncoder.Default already escapes all control characters (U+0000-U+001F) and all non-BasicLatin characters (including \u202E). Pre-sanitizing the strings is redundant and produces double-escaped output.
For example, ESC (\x1B) would normally appear as \u001B in the JSON output. With the sanitizer, it becomes \\u001B, which is a literal backslash followed by u001B. JSON consumers parsing these logs would see the text \u001B instead of the actual ESC character. The test changes in JsonConsoleFormatterTests.cs confirm this: they switched to expecting \\\\u000D\\\\u000A.
The sanitizer should be skipped entirely for the JsonConsoleFormatter path, or at minimum should not run when Utf8JsonWriter is handling the escaping.
Breaking change with default true
Setting SanitizeControlCharacters = true by default changes the output format for every existing application without any opt-in. Exception stack traces go from properly formatted multi-line output to a single blob containing \u000A literals. This will break log parsing tools and dashboards that expect the current format.
Consider either defaulting to false or narrowing the escape set so that \n/\r/\t pass through unchanged (which would make the default safe).
Minor issues
- API review: adding a public property to
ConsoleFormatterOptionsrequires going through the dotnet/runtime API review process. - Allocations: every exception log triggers a
StringBuilderallocation since exception strings always contain\n. Considerstring.CreateorValueStringBuilderfor the hot path. - Test coverage:
Log_ControlCharacters_SanitizationCanBeDisabledonly tests Simple and Systemd formatters. JSON opt-out is not covered. - Existing test expectations modified: the changes to
ConsoleLoggerTest.csnormalize broken formatting as the new expected output rather than preserving the original behavior.
|
A note on JSON output and custom encoders:
However, dotnet's JSON writer supports pluggable encoders, and one of them, Why we should leave it as it for now: • The encoder's name literally contains "Unsafe", so using it is a deliberate opt-in to relaxed behaviour. If this proves to be a real-world concern, we can add targeted sanitisation to the JSON path later without any API change. |
There was a problem hiding this comment.
Pull request overview
This PR introduces a shared sanitizer (ConsoleControlCharacterSanitizer) and applies it to the Simple and Systemd console formatter pipelines to reduce the risk of untrusted control / formatting characters influencing terminal output. It also adds new unit tests intended to validate sanitization behavior for non-JSON formatters.
Changes:
- Added
ConsoleControlCharacterSanitizerand used it to sanitize message / exception / category / scope text inSimpleConsoleFormatterandSystemdConsoleFormatter. - Updated
JsonConsoleFormatter’scharstate-property handling (but Json formatter still does not apply the new sanitizer). - Added tests for sanitization behavior (currently for non-JSON formatters only) and updated minor test code comments.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| src/libraries/Microsoft.Extensions.Logging.Console/tests/Microsoft.Extensions.Logging.Console.Tests/ConsoleLoggerTest.cs | Removes Arrange/Act/Assert comments in an existing test. |
| src/libraries/Microsoft.Extensions.Logging.Console/tests/Microsoft.Extensions.Logging.Console.Tests/ConsoleFormatterTests.cs | Adds new sanitization tests (currently limited to non-JSON formatters). |
| src/libraries/Microsoft.Extensions.Logging.Console/src/SystemdConsoleFormatter.cs | Sanitizes message/exception/category and scope string output prior to writing. |
| src/libraries/Microsoft.Extensions.Logging.Console/src/SimpleConsoleFormatter.cs | Sanitizes message/exception/category and scope string output prior to writing. |
| src/libraries/Microsoft.Extensions.Logging.Console/src/Microsoft.Extensions.Logging.Console.csproj | Adds ValueStringBuilder to support the sanitizer implementation. |
| src/libraries/Microsoft.Extensions.Logging.Console/src/JsonConsoleFormatter.cs | Adjusts JSON state-property writing for char values (allocates) but does not add sanitizer usage. |
| src/libraries/Microsoft.Extensions.Logging.Console/src/ConsoleControlCharacterSanitizer.cs | New sanitizer that escapes a hardcoded set of characters to \\uXXXX. |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
| >= '\u200B' and <= '\u200F' => true, // zero-width and directional marks | ||
| >= '\u202A' and <= '\u202E' => true, // bidi embedding/override | ||
| >= '\u2066' and <= '\u2069' => true, // bidi isolates | ||
| _ => false, |
There was a problem hiding this comment.
I am curious how did we choose regions? There are many other questionable ranges. For example https://en.wikipedia.org/wiki/Tags_(Unicode_block). I think the more proper way would be to decide based on https://learn.microsoft.com/en-us/dotnet/api/system.char.getunicodecategory?view=net-10.0 but did not check for any edge cases using this approach. I also recommend reading https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-encoding-introduction which list many interesting and possibly error prone parts.
There was a problem hiding this comment.
I agree. Can we leverage GetUnicodeCategory without giving up perf using SearchValues? At least for the fast non-sanitizing path? E.g:
// Built once. GetUnicodeCategory cost is paid here, not per log call.
private static readonly SearchValues<char> s_charsToEscape = SearchValues.Create(BuildEscapeSet());
private static char[] BuildEscapeSet()
{
var chars = new List<char>(256);
for (int c = 0; c <= 0xFFFF; c++)
{
if (c is '\t' or '\r' or '\n')
continue;
if (char.GetUnicodeCategory((char)c) is UnicodeCategory.Control or UnicodeCategory.Format)
chars.Add((char)c);
}
return chars.ToArray();
}
public static string Sanitize(string value)
{
int idx = value.AsSpan().IndexOfAny(s_charsToEscape); // vectorized
if (idx < 0)
return value;|
Are we sure that Console Logger is right place to do such sanitization? I think it will negatively impact performance, even in many case when static constant strings are passed and sanitization will happen every time they are logged which seems like big overhead considering MECFG scale. I did not read thread model or any related documents, but to me it seems that this should be responsibility of caller. |
@BrennanConroy, @halter73, do you have any thoughts? Honestly, the argument users are supposed to sanitise the log message because they know the context, makes complete sense. We can still provide a convenience method to help them. |
| sanitized.Append('\\'); | ||
| sanitized.Append('u'); | ||
| int codePoint = current; | ||
| Span<char> hex = sanitized.AppendSpan(4); | ||
| hex[0] = ToHexChar(codePoint >> 12); | ||
| hex[1] = ToHexChar((codePoint >> 8) & 0xF); | ||
| hex[2] = ToHexChar((codePoint >> 4) & 0xF); | ||
| hex[3] = ToHexChar(codePoint & 0xF); |
There was a problem hiding this comment.
Nit: If AppendSpan() is better, should we use it for all 6 characters?
| sanitized.Append('\\'); | |
| sanitized.Append('u'); | |
| int codePoint = current; | |
| Span<char> hex = sanitized.AppendSpan(4); | |
| hex[0] = ToHexChar(codePoint >> 12); | |
| hex[1] = ToHexChar((codePoint >> 8) & 0xF); | |
| hex[2] = ToHexChar((codePoint >> 4) & 0xF); | |
| hex[3] = ToHexChar(codePoint & 0xF); | |
| Span<char> escaped = sanitized.AppendSpan(6); | |
| escaped[0] = '\\'; | |
| escaped[1] = 'u'; | |
| escaped[2] = ToHexChar(current >> 12); | |
| escaped[3] = ToHexChar((current >> 8) & 0xF); | |
| escaped[4] = ToHexChar((current >> 4) & 0xF); | |
| escaped[5] = ToHexChar(current & 0xF); |
| sanitized.Append('\\'); | ||
| sanitized.Append('u'); | ||
| int codePoint = current; | ||
| Span<char> hex = sanitized.AppendSpan(4); | ||
| hex[0] = ToHexChar(codePoint >> 12); | ||
| hex[1] = ToHexChar((codePoint >> 8) & 0xF); | ||
| hex[2] = ToHexChar((codePoint >> 4) & 0xF); | ||
| hex[3] = ToHexChar(codePoint & 0xF); |
There was a problem hiding this comment.
I'm not against this approach, but a simpler alternative would be to replace everything with � a.k.a. '\uFFFD' a.k.a. U+FFFD REPLACEMENT CHARACTER.
There was a problem hiding this comment.
I don't think � is a good replacement. Converting to hex number is way more "expected".
| >= '\u200B' and <= '\u200F' => true, // zero-width and directional marks | ||
| >= '\u202A' and <= '\u202E' => true, // bidi embedding/override | ||
| >= '\u2066' and <= '\u2069' => true, // bidi isolates | ||
| _ => false, |
There was a problem hiding this comment.
Are there no problematic code points outside the BMP (i.e. those that would be composed of two chars)?
|
The logger implementation (Console) knows what safe and unsafe input for its output is, the logging callers (app code) does not know what is safe or unsafe. |
|
Following up on the performance question raised above. Independent of where sanitization ultimately lives, if it stays in the Console logger it runs on the hot path for every The same escape set can be expressed as a static
(Local numbers, not CI hardware, so treat the absolute values as indicative.) The cost is pure CPU per log call, multiplied across the four sanitized values. Suggestions:
This keeps the point that the logger is the right place to know what is unsafe for its output, while removing the per-call overhead. |
| hex[0] = ToHexChar(codePoint >> 12); | ||
| hex[1] = ToHexChar((codePoint >> 8) & 0xF); | ||
| hex[2] = ToHexChar((codePoint >> 4) & 0xF); | ||
| hex[3] = ToHexChar(codePoint & 0xF); |
There was a problem hiding this comment.
Remove the ToHexChar and use HexConverter.ToCharUpper (Common/System/HexConverter.cs).
| hex[0] = ToHexChar(codePoint >> 12); | |
| hex[1] = ToHexChar((codePoint >> 8) & 0xF); | |
| hex[2] = ToHexChar((codePoint >> 4) & 0xF); | |
| hex[3] = ToHexChar(codePoint & 0xF); | |
| hex[0] = HexConverter.ToCharUpper(current >> 12); | |
| hex[1] = HexConverter.ToCharUpper(current >> 8); | |
| hex[2] = HexConverter.ToCharUpper(current >> 4); | |
| hex[3] = HexConverter.ToCharUpper(current); |
You don't need to mask the shifts - ToCharUpper does value &= 0xF internally, so the leftover high bits are discarded for free.
Or if you want to avoid hand-composing the values use ((int)current).TryFormat(hex, out _, "X4").
Fixes #128727
Console logging currently writes untrusted control characters verbatim, allowing terminal escape/control effects and ambiguous output. This change sanitizes control/format characters across
Simple,Systemd, andJsonformatter paths.Behavioral change
Cc/Cfcharacters as\uXXXXbefore writing log output.Implementation
ConsoleControlCharacterSanitizer.Tests