feat(string): add UTF-8 string conversion and validation functions#2528
feat(string): add UTF-8 string conversion and validation functions#2528bobtista wants to merge 1 commit intoTheSuperHackers:mainfrom
Conversation
|
| Filename | Overview |
|---|---|
| Core/Libraries/Source/WWVegas/WWLib/utf8.h | New header with correct #pragma once guard, clear null-terminator contract in comments, and consistent Snake_Case naming |
| Core/Libraries/Source/WWVegas/WWLib/utf8.cpp | Win32-backed UTF-8 implementation; rejects overlong encodings correctly but surrogate codepoints U+D800–U+DFFF pass the validator |
| Core/Libraries/Source/WWVegas/WWLib/CMakeLists.txt | utf8.cpp added to unconditional source list despite being Windows-only; should be guarded by if(WIN32) |
| Core/GameEngine/Source/Common/System/AsciiString.cpp | translate() upgraded from 7-bit ASCII loop to proper UTF-8 via WWLib helpers; ensureUniqueBufferOfSize null-terminator fix is correct |
| Core/GameEngine/Source/Common/System/UnicodeString.cpp | Symmetric UTF-8 upgrade for UnicodeString::translate(); ensureUniqueBufferOfSize fix applied correctly on the wide-char side |
| Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp | MultiByteToWideCharSingleLine and WideCharStringToMultiByte refactored to WWLib helpers; std::wstring manages null termination correctly so the removed manual write was the right call |
Sequence Diagram
sequenceDiagram
participant Caller
participant AsciiString
participant utf8lib as WWLib/utf8
participant Win32
Caller->>AsciiString: translate(UnicodeString)
AsciiString->>utf8lib: Get_Utf8_Size(src, srcLen)
utf8lib->>Win32: WideCharToMultiByte(CP_UTF8, query size)
Win32-->>utf8lib: byte count
utf8lib-->>AsciiString: size
AsciiString->>AsciiString: ensureUniqueBufferOfSize(size+1)<br/>sets buf[size]=0
AsciiString->>utf8lib: Unicode_To_Utf8(buf, src, srcLen, size)
utf8lib->>Win32: WideCharToMultiByte(CP_UTF8, write)
Win32-->>utf8lib: result
utf8lib-->>AsciiString: true / false
AsciiString-->>Caller: UTF-8 string (or clear() on failure)
Prompt To Fix All With AI
This is a comment left during a code review.
Path: Core/Libraries/Source/WWVegas/WWLib/utf8.cpp
Line: 83-95
Comment:
**Surrogate range not rejected per RFC 3629**
`Utf8_Validate` correctly handles overlong encodings (the comment even cites RFC 3629), but 3-byte sequences encoding surrogates U+D800–U+DFFF (`0xED 0xA0–0xBF 0x80–0xBF`) still pass. RFC 3629 §3 explicitly forbids encoding surrogates in UTF-8. The downstream `MultiByteToWideChar` will reject them at conversion time, so this isn't an immediate security hole, but the validator's stated contract is incomplete. A one-line guard closes the gap:
```suggestion
// Reject overlong encodings per RFC 3629
if (bytes == 2 && s[i] < 0xC2)
return false;
if (bytes == 3 && s[i] == 0xE0 && s[i + 1] < 0xA0)
return false;
// Reject surrogates (U+D800-U+DFFF) per RFC 3629
if (bytes == 3 && s[i] == 0xED && s[i + 1] >= 0xA0)
return false;
if (bytes == 4 && s[i] == 0xF0 && s[i + 1] < 0x90)
return false;
// Reject codepoints above U+10FFFF
if (bytes == 4 && s[i] > 0xF4)
return false;
if (bytes == 4 && s[i] == 0xF4 && s[i + 1] > 0x8F)
return false;
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: Core/Libraries/Source/WWVegas/WWLib/CMakeLists.txt
Line: 136-137
Comment:
**`utf8.cpp` should be in the `if(WIN32)` block**
`utf8.cpp` wraps Windows-only APIs and guards non-Windows builds with `#error "Not implemented"`, exactly like `thread.cpp`. Both files currently sit in the unconditional source list rather than in the `if(WIN32)` block (line 156). CI is Windows-only today so no build breaks now, but moving this into the platform guard makes the intent explicit and avoids a hard compile error if a non-Windows build is ever attempted.
Suggested change: remove `utf8.cpp` and `utf8.h` from the unconditional list and add `utf8.cpp` inside the `if(WIN32)` block alongside `registry.cpp` etc. (`utf8.h` can stay in the common list since it only declares — the linker error would surface the missing implementation clearly).
How can I resolve this? If you propose a fix, please make it concise.Reviews (8): Last reviewed commit: "feat(utf8): add UTF-8 string conversion ..." | Re-trigger Greptile
|
Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp
Outdated
Show resolved
Hide resolved
Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp
Outdated
Show resolved
Hide resolved
| DEBUG_LOG(("ParseAsciiStringToGameInfo - slotValue name is empty, quitting")); | ||
| break; | ||
| } | ||
| // TheSuperHackers @fix bobtista 02/04/2026 Validate UTF-8 encoding before processing player name |
There was a problem hiding this comment.
This appears to be beyond the scope of this change. It is not describes in the title. Perhaps is a separate change?
xezon
left a comment
There was a problem hiding this comment.
Get_Utf8_Size should not include the null terminator in its size.
Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp
Outdated
Show resolved
Hide resolved
| delete[] dest; | ||
| } | ||
| size_t size = Get_Utf8_Size(orig); | ||
| std::string ret(size - 1, '\0'); |
Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp
Outdated
Show resolved
Hide resolved
| if (dest_size == 0) | ||
| return; | ||
| return false; | ||
| int result = MultiByteToWideChar(CP_UTF8, 0, src, -1, dest, (int)dest_size); |
There was a problem hiding this comment.
What happens if dest_size does not have enough room for a null terminator?
There was a problem hiding this comment.
The doc says "Does not write a null terminator" - should we add more comments? Change the functions to always null-terminate? What do you want here?
| if (!Unicode_To_Utf8(buf, src, srcLen, size)) | ||
| clear(); | ||
| else | ||
| buf[size] = '\0'; |
There was a problem hiding this comment.
I think it would be good to make ensureUniqueBufferOfSize write the zero terminator always. That would be consistent with std::string::resize.
| validate(); | ||
| /// @todo srj put in a real translation here; this will only work for 7-bit ascii | ||
| // TheSuperHackers @fix bobtista 02/04/2026 Implement UTF-8 conversion replacing 7-bit ASCII only implementation | ||
| clear(); |
There was a problem hiding this comment.
Is clear really desired here? AsciiString::set does not call clear. It would reuse the buffer if it already had one that was large enough.
There was a problem hiding this comment.
Removed the upfront clear(), now we only clear on size == 0
| size_t size = Get_Unicode_Size(orig, srcLen); | ||
| if (size == 0) | ||
| return std::wstring(); | ||
| std::wstring ret(size, L'\0'); |
There was a problem hiding this comment.
Do we need to fill the string with zeros?
| return (wchars > 0) ? (size_t)wchars : 0; | ||
| } | ||
|
|
||
| bool Unicode_To_Utf8(char* dest, const wchar_t* src, size_t srcLen, size_t dest_size) |
There was a problem hiding this comment.
Using different naming styles here: srcLen, dest_size
| return result != 0; | ||
| } | ||
|
|
||
| bool Utf8_To_Unicode(wchar_t* dest, const char* src, size_t srcLen, size_t dest_size) |
There was a problem hiding this comment.
Is the naming choice for src len and dest size deliberate?
|
The diff now shows unrelated changes. |
39d7229 to
40393b8
Compare
Try again cleaned up the commits and force pushed |
Adds UTF-8 string handling to WWLib and plumbs it through the codebase, replacing the GameSpy-specific Win32 wrappers with a shared implementation.
Picks up the work proposed in #2045 by @slurmlord, with API adjustments per the review from @xezon.
New:
WWLib/utf8.h/utf8.cppUtf8_Num_Bytes(char lead)— byte count of a UTF-8 character from its lead byteUtf8_Trailing_Invalid_Bytes(const char* str, size_t length)— count of invalid trailing bytes due to an incomplete multi-byte sequenceUtf8_Validate(const char* str)/Utf8_Validate(const char* str, size_t length)— returns true if the string is valid UTF-8Get_Utf8_Size(const wchar_t* src)/Get_Wchar_Size(const char* src)— required buffer sizes including null terminatorWchar_To_Utf8(char* dest, const wchar_t* src, size_t dest_size)Utf8_To_Wchar(wchar_t* dest, const char* src, size_t dest_size)Naming follows the
Snake_Caseconvention used in WWVegas. Arguments are ordereddest, srcmatchingmemcpyconvention. Implementation wraps Win32WideCharToMultiByte/MultiByteToWideChar.AsciiString::translate/UnicodeString::translateReplaces the broken implementations that only worked for 7-bit ASCII (marked
@todosince the original code) with proper UTF-8 conversion using the new WWLib functions.ThreadUtils.cppReplaces raw Win32 API calls in
MultiByteToWideCharSingleLineandWideCharStringToMultiBytewith the new WWLib functions. Also removes the manualdest[len] = 0null terminator write, which was writing at the wrong position for multi-byte UTF-8 input.