Skip to content

feat(string): add UTF-8 string conversion and validation functions#2528

Open
bobtista wants to merge 1 commit intoTheSuperHackers:mainfrom
bobtista:bobtista/feat/utf8-string-functions
Open

feat(string): add UTF-8 string conversion and validation functions#2528
bobtista wants to merge 1 commit intoTheSuperHackers:mainfrom
bobtista:bobtista/feat/utf8-string-functions

Conversation

@bobtista
Copy link
Copy Markdown

@bobtista bobtista commented Apr 3, 2026

Adds UTF-8 string handling to WWLib and plumbs it through the codebase, replacing the GameSpy-specific Win32 wrappers with a shared implementation.

Picks up the work proposed in #2045 by @slurmlord, with API adjustments per the review from @xezon.

New: WWLib/utf8.h / utf8.cpp

  • Utf8_Num_Bytes(char lead) — byte count of a UTF-8 character from its lead byte
  • Utf8_Trailing_Invalid_Bytes(const char* str, size_t length) — count of invalid trailing bytes due to an incomplete multi-byte sequence
  • Utf8_Validate(const char* str) / Utf8_Validate(const char* str, size_t length) — returns true if the string is valid UTF-8
  • Get_Utf8_Size(const wchar_t* src) / Get_Wchar_Size(const char* src) — required buffer sizes including null terminator
  • Wchar_To_Utf8(char* dest, const wchar_t* src, size_t dest_size)
  • Utf8_To_Wchar(wchar_t* dest, const char* src, size_t dest_size)

Naming follows the Snake_Case convention used in WWVegas. Arguments are ordered dest, src matching memcpy convention. Implementation wraps Win32 WideCharToMultiByte / MultiByteToWideChar.

AsciiString::translate / UnicodeString::translate

Replaces the broken implementations that only worked for 7-bit ASCII (marked @todo since the original code) with proper UTF-8 conversion using the new WWLib functions.

ThreadUtils.cpp

Replaces raw Win32 API calls in MultiByteToWideCharSingleLine and WideCharStringToMultiByte with the new WWLib functions. Also removes the manual dest[len] = 0 null terminator write, which was writing at the wrong position for multi-byte UTF-8 input.

@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Apr 3, 2026

Greptile Summary

This PR replaces the broken 7-bit-only AsciiString::translate() and UnicodeString::translate() implementations with proper UTF-8 conversion via a new WWLib/utf8.h library, and also cleans up raw Win32 API calls in ThreadUtils.cpp. The previous review concern about overlong encodings has been addressed — the validator now explicitly rejects \xC0/\xC1, \xE0\x{<0xA0}, and \xF0\x{<0x90} sequences per RFC 3629. The ensureUniqueBufferOfSize null-terminator fix (moving peek()[usableNumChars] = 0 outside the if (strToCopy) block) is a correct and meaningful improvement.

Confidence Score: 5/5

Safe to merge — all P1 concerns from prior review rounds are addressed; remaining findings are P2 style improvements.

Both open findings are P2: the surrogate gap in the validator is mitigated by downstream Win32 API rejection, and the CMakeLists platform guard is consistent with how thread.cpp is handled and doesn't affect the Windows-only CI. The core correctness of translate(), the ensureUniqueBufferOfSize fix, and the ThreadUtils refactor is sound.

utf8.cpp (surrogate range) and CMakeLists.txt (platform guard) are worth a follow-up but are not blockers.

Important Files Changed

Filename Overview
Core/Libraries/Source/WWVegas/WWLib/utf8.h New header with correct #pragma once guard, clear null-terminator contract in comments, and consistent Snake_Case naming
Core/Libraries/Source/WWVegas/WWLib/utf8.cpp Win32-backed UTF-8 implementation; rejects overlong encodings correctly but surrogate codepoints U+D800–U+DFFF pass the validator
Core/Libraries/Source/WWVegas/WWLib/CMakeLists.txt utf8.cpp added to unconditional source list despite being Windows-only; should be guarded by if(WIN32)
Core/GameEngine/Source/Common/System/AsciiString.cpp translate() upgraded from 7-bit ASCII loop to proper UTF-8 via WWLib helpers; ensureUniqueBufferOfSize null-terminator fix is correct
Core/GameEngine/Source/Common/System/UnicodeString.cpp Symmetric UTF-8 upgrade for UnicodeString::translate(); ensureUniqueBufferOfSize fix applied correctly on the wide-char side
Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp MultiByteToWideCharSingleLine and WideCharStringToMultiByte refactored to WWLib helpers; std::wstring manages null termination correctly so the removed manual write was the right call

Sequence Diagram

sequenceDiagram
    participant Caller
    participant AsciiString
    participant utf8lib as WWLib/utf8
    participant Win32

    Caller->>AsciiString: translate(UnicodeString)
    AsciiString->>utf8lib: Get_Utf8_Size(src, srcLen)
    utf8lib->>Win32: WideCharToMultiByte(CP_UTF8, query size)
    Win32-->>utf8lib: byte count
    utf8lib-->>AsciiString: size
    AsciiString->>AsciiString: ensureUniqueBufferOfSize(size+1)<br/>sets buf[size]=0
    AsciiString->>utf8lib: Unicode_To_Utf8(buf, src, srcLen, size)
    utf8lib->>Win32: WideCharToMultiByte(CP_UTF8, write)
    Win32-->>utf8lib: result
    utf8lib-->>AsciiString: true / false
    AsciiString-->>Caller: UTF-8 string (or clear() on failure)
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: Core/Libraries/Source/WWVegas/WWLib/utf8.cpp
Line: 83-95

Comment:
**Surrogate range not rejected per RFC 3629**

`Utf8_Validate` correctly handles overlong encodings (the comment even cites RFC 3629), but 3-byte sequences encoding surrogates U+D800–U+DFFF (`0xED 0xA0–0xBF 0x80–0xBF`) still pass. RFC 3629 §3 explicitly forbids encoding surrogates in UTF-8. The downstream `MultiByteToWideChar` will reject them at conversion time, so this isn't an immediate security hole, but the validator's stated contract is incomplete. A one-line guard closes the gap:

```suggestion
		// Reject overlong encodings per RFC 3629
		if (bytes == 2 && s[i] < 0xC2)
			return false;
		if (bytes == 3 && s[i] == 0xE0 && s[i + 1] < 0xA0)
			return false;
		// Reject surrogates (U+D800-U+DFFF) per RFC 3629
		if (bytes == 3 && s[i] == 0xED && s[i + 1] >= 0xA0)
			return false;
		if (bytes == 4 && s[i] == 0xF0 && s[i + 1] < 0x90)
			return false;
		// Reject codepoints above U+10FFFF
		if (bytes == 4 && s[i] > 0xF4)
			return false;
		if (bytes == 4 && s[i] == 0xF4 && s[i + 1] > 0x8F)
			return false;
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: Core/Libraries/Source/WWVegas/WWLib/CMakeLists.txt
Line: 136-137

Comment:
**`utf8.cpp` should be in the `if(WIN32)` block**

`utf8.cpp` wraps Windows-only APIs and guards non-Windows builds with `#error "Not implemented"`, exactly like `thread.cpp`. Both files currently sit in the unconditional source list rather than in the `if(WIN32)` block (line 156). CI is Windows-only today so no build breaks now, but moving this into the platform guard makes the intent explicit and avoids a hard compile error if a non-Windows build is ever attempted.

Suggested change: remove `utf8.cpp` and `utf8.h` from the unconditional list and add `utf8.cpp` inside the `if(WIN32)` block alongside `registry.cpp` etc. (`utf8.h` can stay in the common list since it only declares — the linker error would surface the missing implementation clearly).

How can I resolve this? If you propose a fix, please make it concise.

Reviews (8): Last reviewed commit: "feat(utf8): add UTF-8 string conversion ..." | Re-trigger Greptile

@bobtista
Copy link
Copy Markdown
Author

bobtista commented Apr 3, 2026

  • Fixed the if formatting
  • added RFC 3629 overlong and out-of-range checks
  • RE the theoretical memory leak, can that even happen here? set() allocates via the engine's custom memory allocator which crashes on failure rather than throwing, so the leak path can't really be reached right?

DEBUG_LOG(("ParseAsciiStringToGameInfo - slotValue name is empty, quitting"));
break;
}
// TheSuperHackers @fix bobtista 02/04/2026 Validate UTF-8 encoding before processing player name
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be beyond the scope of this change. It is not describes in the title. Perhaps is a separate change?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok reverted

@xezon xezon added Enhancement Is new feature or request Minor Severity: Minor < Major < Critical < Blocker labels Apr 3, 2026
Copy link
Copy Markdown

@xezon xezon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get_Utf8_Size should not include the null terminator in its size.

delete[] dest;
}
size_t size = Get_Utf8_Size(orig);
std::string ret(size - 1, '\0');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why zero fill?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to use resize()

if (dest_size == 0)
return;
return false;
int result = MultiByteToWideChar(CP_UTF8, 0, src, -1, dest, (int)dest_size);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if dest_size does not have enough room for a null terminator?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc says "Does not write a null terminator" - should we add more comments? Change the functions to always null-terminate? What do you want here?

if (!Unicode_To_Utf8(buf, src, srcLen, size))
clear();
else
buf[size] = '\0';
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to make ensureUniqueBufferOfSize write the zero terminator always. That would be consistent with std::string::resize.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

validate();
/// @todo srj put in a real translation here; this will only work for 7-bit ascii
// TheSuperHackers @fix bobtista 02/04/2026 Implement UTF-8 conversion replacing 7-bit ASCII only implementation
clear();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is clear really desired here? AsciiString::set does not call clear. It would reuse the buffer if it already had one that was large enough.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the upfront clear(), now we only clear on size == 0

size_t size = Get_Unicode_Size(orig, srcLen);
if (size == 0)
return std::wstring();
std::wstring ret(size, L'\0');
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to fill the string with zeros?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to resize()

return (wchars > 0) ? (size_t)wchars : 0;
}

bool Unicode_To_Utf8(char* dest, const wchar_t* src, size_t srcLen, size_t dest_size)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using different naming styles here: srcLen, dest_size

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

went with destSize

return result != 0;
}

bool Utf8_To_Unicode(wchar_t* dest, const char* src, size_t srcLen, size_t dest_size)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the naming choice for src len and dest size deliberate?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, using camelCase now

@xezon
Copy link
Copy Markdown

xezon commented Apr 6, 2026

The diff now shows unrelated changes.

@bobtista bobtista force-pushed the bobtista/feat/utf8-string-functions branch from 39d7229 to 40393b8 Compare April 6, 2026 20:02
@bobtista
Copy link
Copy Markdown
Author

bobtista commented Apr 6, 2026

The diff now shows unrelated changes.

Try again cleaned up the commits and force pushed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Enhancement Is new feature or request Minor Severity: Minor < Major < Critical < Blocker

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants