feat(string): add UTF-8 string conversion and validation functions by bobtista · Pull Request #2528 · TheSuperHackers/GeneralsGameCode

bobtista · 2026-04-03T00:07:30Z

Relates to refactor(string): Add functions for handling UTF8 encoded strings #2045

Adds UTF-8 string handling to WWLib and plumbs it through the codebase, replacing the GameSpy-specific Win32 wrappers with a shared implementation.

Picks up the work proposed in #2045 by @slurmlord, with API adjustments per the review from @xezon.

New: `WWLib/utf8.h` / `utf8.cpp`

Utf8_Num_Bytes(char lead) — byte count of a UTF-8 character from its lead byte
Utf8_Trailing_Invalid_Bytes(const char* str, size_t length) — count of invalid trailing bytes due to an incomplete multi-byte sequence
Utf8_Validate(const char* str) / Utf8_Validate(const char* str, size_t length) — returns true if the string is valid UTF-8
Get_Utf8_Size(const wchar_t* src) / Get_Wchar_Size(const char* src) — required buffer sizes including null terminator
Wchar_To_Utf8(char* dest, const wchar_t* src, size_t dest_size)
Utf8_To_Wchar(wchar_t* dest, const char* src, size_t dest_size)

Naming follows the Snake_Case convention used in WWVegas. Arguments are ordered dest, src matching memcpy convention. Implementation wraps Win32 WideCharToMultiByte / MultiByteToWideChar.

`AsciiString::translate` / `UnicodeString::translate`

Replaces the broken implementations that only worked for 7-bit ASCII (marked @todo since the original code) with proper UTF-8 conversion using the new WWLib functions.

`ThreadUtils.cpp`

Replaces raw Win32 API calls in MultiByteToWideCharSingleLine and WideCharStringToMultiByte with the new WWLib functions. Also removes the manual dest[len] = 0 null terminator write, which was writing at the wrong position for multi-byte UTF-8 input.

greptile-apps · 2026-04-03T00:12:52Z

Greptile Summary

This PR replaces the broken 7-bit-only AsciiString::translate() and UnicodeString::translate() implementations with proper UTF-8 conversion via a new WWLib/utf8.h library, and also cleans up raw Win32 API calls in ThreadUtils.cpp. The previous review concern about overlong encodings has been addressed — the validator now explicitly rejects \xC0/\xC1, \xE0\x{<0xA0}, and \xF0\x{<0x90} sequences per RFC 3629. The ensureUniqueBufferOfSize null-terminator fix (moving peek()[usableNumChars] = 0 outside the if (strToCopy) block) is a correct and meaningful improvement.

Confidence Score: 5/5

Safe to merge — all P1 concerns from prior review rounds are addressed; remaining findings are P2 style improvements.

Both open findings are P2: the surrogate gap in the validator is mitigated by downstream Win32 API rejection, and the CMakeLists platform guard is consistent with how thread.cpp is handled and doesn't affect the Windows-only CI. The core correctness of translate(), the ensureUniqueBufferOfSize fix, and the ThreadUtils refactor is sound.

utf8.cpp (surrogate range) and CMakeLists.txt (platform guard) are worth a follow-up but are not blockers.

Important Files Changed

Filename	Overview
Core/Libraries/Source/WWVegas/WWLib/utf8.h	New header with correct #pragma once guard, clear null-terminator contract in comments, and consistent Snake_Case naming
Core/Libraries/Source/WWVegas/WWLib/utf8.cpp	Win32-backed UTF-8 implementation; rejects overlong encodings correctly but surrogate codepoints U+D800–U+DFFF pass the validator
Core/Libraries/Source/WWVegas/WWLib/CMakeLists.txt	utf8.cpp added to unconditional source list despite being Windows-only; should be guarded by if(WIN32)
Core/GameEngine/Source/Common/System/AsciiString.cpp	translate() upgraded from 7-bit ASCII loop to proper UTF-8 via WWLib helpers; ensureUniqueBufferOfSize null-terminator fix is correct
Core/GameEngine/Source/Common/System/UnicodeString.cpp	Symmetric UTF-8 upgrade for UnicodeString::translate(); ensureUniqueBufferOfSize fix applied correctly on the wide-char side
Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp	MultiByteToWideCharSingleLine and WideCharStringToMultiByte refactored to WWLib helpers; std::wstring manages null termination correctly so the removed manual write was the right call

Sequence Diagram

sequenceDiagram
    participant Caller
    participant AsciiString
    participant utf8lib as WWLib/utf8
    participant Win32

    Caller->>AsciiString: translate(UnicodeString)
    AsciiString->>utf8lib: Get_Utf8_Size(src, srcLen)
    utf8lib->>Win32: WideCharToMultiByte(CP_UTF8, query size)
    Win32-->>utf8lib: byte count
    utf8lib-->>AsciiString: size
    AsciiString->>AsciiString: ensureUniqueBufferOfSize(size+1)<br/>sets buf[size]=0
    AsciiString->>utf8lib: Unicode_To_Utf8(buf, src, srcLen, size)
    utf8lib->>Win32: WideCharToMultiByte(CP_UTF8, write)
    Win32-->>utf8lib: result
    utf8lib-->>AsciiString: true / false
    AsciiString-->>Caller: UTF-8 string (or clear() on failure)

Prompt To Fix All With AI

This is a comment left during a code review.
Path: Core/Libraries/Source/WWVegas/WWLib/utf8.cpp
Line: 83-95

Comment:
**Surrogate range not rejected per RFC 3629**

`Utf8_Validate` correctly handles overlong encodings (the comment even cites RFC 3629), but 3-byte sequences encoding surrogates U+D800–U+DFFF (`0xED 0xA0–0xBF 0x80–0xBF`) still pass. RFC 3629 §3 explicitly forbids encoding surrogates in UTF-8. The downstream `MultiByteToWideChar` will reject them at conversion time, so this isn't an immediate security hole, but the validator's stated contract is incomplete. A one-line guard closes the gap:

```suggestion
		// Reject overlong encodings per RFC 3629
		if (bytes == 2 && s[i] < 0xC2)
			return false;
		if (bytes == 3 && s[i] == 0xE0 && s[i + 1] < 0xA0)
			return false;
		// Reject surrogates (U+D800-U+DFFF) per RFC 3629
		if (bytes == 3 && s[i] == 0xED && s[i + 1] >= 0xA0)
			return false;
		if (bytes == 4 && s[i] == 0xF0 && s[i + 1] < 0x90)
			return false;
		// Reject codepoints above U+10FFFF
		if (bytes == 4 && s[i] > 0xF4)
			return false;
		if (bytes == 4 && s[i] == 0xF4 && s[i + 1] > 0x8F)
			return false;
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: Core/Libraries/Source/WWVegas/WWLib/CMakeLists.txt
Line: 136-137

Comment:
**`utf8.cpp` should be in the `if(WIN32)` block**

`utf8.cpp` wraps Windows-only APIs and guards non-Windows builds with `#error "Not implemented"`, exactly like `thread.cpp`. Both files currently sit in the unconditional source list rather than in the `if(WIN32)` block (line 156). CI is Windows-only today so no build breaks now, but moving this into the platform guard makes the intent explicit and avoids a hard compile error if a non-Windows build is ever attempted.

Suggested change: remove `utf8.cpp` and `utf8.h` from the unconditional list and add `utf8.cpp` inside the `if(WIN32)` block alongside `registry.cpp` etc. (`utf8.h` can stay in the common list since it only declares — the linker error would surface the missing implementation clearly).

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (8): Last reviewed commit: "feat(utf8): add UTF-8 string conversion ..." | Re-trigger Greptile}

Core/Libraries/Source/WWVegas/WWLib/utf8.cpp

bobtista · 2026-04-03T02:32:50Z

Fixed the if formatting
added RFC 3629 overlong and out-of-range checks
RE the theoretical memory leak, can that even happen here? set() allocates via the engine's custom memory allocator which crashes on failure rather than throwing, so the leak path can't really be reached right?

Core/Libraries/Source/WWVegas/WWLib/utf8.h

Core/Libraries/Source/WWVegas/WWLib/utf8.cpp

Core/GameEngine/Source/GameNetwork/GameInfo.cpp

Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp

Core/GameEngine/Source/Common/System/UnicodeString.cpp

Core/GameEngine/Source/Common/System/AsciiString.cpp

xezon · 2026-04-03T10:02:10Z

Core/GameEngine/Source/GameNetwork/GameInfo.cpp

 								DEBUG_LOG(("ParseAsciiStringToGameInfo - slotValue name is empty, quitting"));
 								break;
 							}
+							// TheSuperHackers @fix bobtista 02/04/2026 Validate UTF-8 encoding before processing player name


This appears to be beyond the scope of this change. It is not describes in the title. Perhaps is a separate change?

Ok reverted

Core/GameEngine/Source/Common/System/AsciiString.cpp

Core/GameEngine/Source/GameNetwork/GameInfo.cpp

xezon

Get_Utf8_Size should not include the null terminator in its size.

Core/GameEngine/Source/Common/System/AsciiString.cpp

Core/Libraries/Source/WWVegas/WWLib/utf8.h

Core/Libraries/Source/WWVegas/WWLib/utf8.cpp

Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp

xezon · 2026-04-04T17:15:07Z

Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp

-		delete[] dest;
-	}
+	size_t size = Get_Utf8_Size(orig);
+	std::string ret(size - 1, '\0');


Why zero fill?

Changed to use resize()

Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp

xezon · 2026-04-05T12:17:27Z

Core/Libraries/Source/WWVegas/WWLib/utf8.cpp

 	if (dest_size == 0)
-		return;
+		return false;
 	int result = MultiByteToWideChar(CP_UTF8, 0, src, -1, dest, (int)dest_size);


What happens if dest_size does not have enough room for a null terminator?

The doc says "Does not write a null terminator" - should we add more comments? Change the functions to always null-terminate? What do you want here?

Core/GameEngine/Source/Common/System/UnicodeString.cpp

xezon · 2026-04-06T18:45:02Z

Core/GameEngine/Source/Common/System/AsciiString.cpp

+	if (!Unicode_To_Utf8(buf, src, srcLen, size))
+		clear();
+	else
+		buf[size] = '\0';


I think it would be good to make ensureUniqueBufferOfSize write the zero terminator always. That would be consistent with std::string::resize.

Core/GameEngine/Source/Common/System/AsciiString.cpp

xezon · 2026-04-06T18:49:12Z

Core/GameEngine/Source/Common/System/AsciiString.cpp

 	validate();
-	/// @todo srj put in a real translation here; this will only work for 7-bit ascii
+	// TheSuperHackers @fix bobtista 02/04/2026 Implement UTF-8 conversion replacing 7-bit ASCII only implementation
 	clear();


Is clear really desired here? AsciiString::set does not call clear. It would reuse the buffer if it already had one that was large enough.

Removed the upfront clear(), now we only clear on size == 0

xezon · 2026-04-06T18:50:22Z

Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp

+	size_t size = Get_Unicode_Size(orig, srcLen);
+	if (size == 0)
+		return std::wstring();
+	std::wstring ret(size, L'\0');


Do we need to fill the string with zeros?

Changed to resize()

xezon · 2026-04-06T18:51:37Z

Core/Libraries/Source/WWVegas/WWLib/utf8.cpp

+	return (wchars > 0) ? (size_t)wchars : 0;
+}
+
+bool Unicode_To_Utf8(char* dest, const wchar_t* src, size_t srcLen, size_t dest_size)


Using different naming styles here: srcLen, dest_size

went with destSize

xezon · 2026-04-06T18:52:25Z

Core/Libraries/Source/WWVegas/WWLib/utf8.cpp

+	return result != 0;
+}
+
+bool Utf8_To_Unicode(wchar_t* dest, const char* src, size_t srcLen, size_t dest_size)


Is the naming choice for src len and dest size deliberate?

no, using camelCase now

xezon · 2026-04-06T19:57:25Z

The diff now shows unrelated changes.

bobtista · 2026-04-06T20:04:14Z

The diff now shows unrelated changes.

Try again cleaned up the commits and force pushed

greptile-apps bot reviewed Apr 3, 2026

View reviewed changes

Core/Libraries/Source/WWVegas/WWLib/utf8.cpp Show resolved Hide resolved

xezon reviewed Apr 3, 2026

View reviewed changes

xezon added Enhancement Is new feature or request Minor Severity: Minor < Major < Critical < Blocker labels Apr 3, 2026

greptile-apps bot reviewed Apr 3, 2026

View reviewed changes

Core/GameEngine/Source/GameNetwork/GameInfo.cpp Outdated Show resolved Hide resolved

xezon reviewed Apr 4, 2026

View reviewed changes

xezon reviewed Apr 5, 2026

View reviewed changes

xezon reviewed Apr 6, 2026

View reviewed changes

feat(utf8): add UTF-8 string conversion and validation functions

40393b8

bobtista force-pushed the bobtista/feat/utf8-string-functions branch from 39d7229 to 40393b8 Compare April 6, 2026 20:02

Conversation

bobtista commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New: WWLib/utf8.h / utf8.cpp

AsciiString::translate / UnicodeString::translate

ThreadUtils.cpp

Uh oh!

greptile-apps bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

bobtista commented Apr 3, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

xezon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xezon commented Apr 6, 2026

Uh oh!

bobtista commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

bobtista commented Apr 3, 2026 •

edited

Loading

New: `WWLib/utf8.h` / `utf8.cpp`

`AsciiString::translate` / `UnicodeString::translate`

`ThreadUtils.cpp`

greptile-apps bot commented Apr 3, 2026 •

edited

Loading