Summary
build_col_infos() in logfwd-output/src/lib.rs uses parse_column_name() to strip type suffixes (_str, _int, _float) from Arrow column names before serializing to JSON. parse_column_name uses rfind('_') to find the suffix — but this misidentifies user-defined field names that naturally end with _str, _int, or _float.
Result: A JSON field named status_int with a string value ("ok") is silently renamed to status in output, and the value from a numeric status field wins by dedup priority — completely wrong output.
Reproduction
cargo build --release -p logfwd
# Row 1: field "status_int" has string value "ok", field "count" has int 5
# Row 2: field "status_int" has int value 404 (control)
cat > /tmp/rename_test.log <<'LOGEOF'
{"status_int": "ok", "count": 5}
{"status_int": 404, "count": 3}
LOGEOF
cat > /tmp/rename_test.yaml <<'YAMLEOF'
input:
type: file
path: /tmp/rename_test.log
format: json
output:
type: stdout
format: json
YAMLEOF
./target/release/logfwd --config /tmp/rename_test.yaml &
PID=$!; sleep 2; kill $PID 2>/dev/null; wait
Expected Output
{"status_int":"ok","count":5}
{"status_int":404,"count":3}
Actual Output
{"count":5}
{"status_int":404,"count":3}
Row 1: status_int value "ok" is completely missing. No error logged.
Root Cause
Step 1 — Scanner: {"status_int": "ok"} produces two Arrow columns:
status_int_str (the string "ok")
status_int_int (the int 404, null for row 1)
Step 2 — parse_column_name (lib.rs:86-94):
pub fn parse_column_name(col_name: &str) -> (&str, &str) {
if let Some(pos) = col_name.rfind('_') {
let suffix = &col_name[pos + 1..];
if matches!(suffix, "str" | "int" | "float") {
return (&col_name[..pos], suffix);
}
}
(col_name, "")
}
"status_int_str" → ("status_int", "str") ✓ correct
"status_int_int" → ("status_int", "int") ✓ correct
- Both map to logical field
"status_int". ✓ good so far.
Step 3 — build_col_infos dedup (lib.rs:134-140): Keeps int over str (priority 3 > 1). status_int_str column is discarded.
Step 4 — write_row_json (lib.rs:164): Row 1 has status_int_int = NULL. arr.is_null(row) → true → field silently skipped. Row 1 loses status_int entirely.
More Severe Case: Pure string field named with _int suffix
{"request_timeout_int": "none", "id": 1}
The scanner produces:
request_timeout_int_str → logical field "request_timeout_int", suffix "str"
After parse_column_name("request_timeout_int_str"):
- rfind('_') → position before "str"
- Returns
("request_timeout_int", "str")
Output JSON key becomes "request_timeout_int" — the user's actual field name. This case works correctly. But if the user has a separate "request_timeout" field with an integer:
request_timeout_int → ("request_timeout", "int")
request_timeout_int_str → ("request_timeout_int", "str")
Both survive dedup (different logical names). Output keys are "request_timeout" and "request_timeout_int" — the "request_timeout_int" field is silently renamed to itself (accidentally correct), but "request_timeout" is extracted from the wrong column.
Impact
- Silent data loss and wrong values whenever a user's JSON schema uses field names ending in
_str, _int, or _float (common in structured logging: response_code_int, duration_ms_float, level_str, etc.)
- No error is logged at any stage
- The output is wrong even when
--validate succeeds
Affected Code
crates/logfwd-output/src/lib.rs:86-94 (parse_column_name — suffix stripping too aggressive)
crates/logfwd-output/src/lib.rs:111-143 (build_col_infos — dedup silently drops columns)
Fix
parse_column_name should only strip the suffix from names that logfwd's own scanner added. One approach: use a distinct separator (e.g., :: or #) that cannot appear in JSON keys. Another approach: document that field names ending in _str/_int/_float produce ambiguous output and add a warning.
Short-term fix: skip suffix stripping if the result would be an empty prefix, or track which column names were scanner-generated vs. user-defined.
Summary
build_col_infos()inlogfwd-output/src/lib.rsusesparse_column_name()to strip type suffixes (_str,_int,_float) from Arrow column names before serializing to JSON.parse_column_nameusesrfind('_')to find the suffix — but this misidentifies user-defined field names that naturally end with_str,_int, or_float.Result: A JSON field named
status_intwith a string value ("ok") is silently renamed tostatusin output, and the value from a numericstatusfield wins by dedup priority — completely wrong output.Reproduction
Expected Output
{"status_int":"ok","count":5} {"status_int":404,"count":3}Actual Output
{"count":5} {"status_int":404,"count":3}Row 1:
status_intvalue"ok"is completely missing. No error logged.Root Cause
Step 1 — Scanner:
{"status_int": "ok"}produces two Arrow columns:status_int_str(the string"ok")status_int_int(the int404, null for row 1)Step 2 —
parse_column_name(lib.rs:86-94):"status_int_str"→("status_int", "str")✓ correct"status_int_int"→("status_int", "int")✓ correct"status_int". ✓ good so far.Step 3 —
build_col_infosdedup (lib.rs:134-140): Keepsintoverstr(priority 3 > 1).status_int_strcolumn is discarded.Step 4 —
write_row_json(lib.rs:164): Row 1 hasstatus_int_int = NULL.arr.is_null(row)→true→ field silently skipped. Row 1 losesstatus_intentirely.More Severe Case: Pure string field named with
_intsuffix{"request_timeout_int": "none", "id": 1}The scanner produces:
request_timeout_int_str→ logical field"request_timeout_int", suffix"str"After
parse_column_name("request_timeout_int_str"):("request_timeout_int", "str")Output JSON key becomes
"request_timeout_int"— the user's actual field name. This case works correctly. But if the user has a separate"request_timeout"field with an integer:request_timeout_int→("request_timeout", "int")request_timeout_int_str→("request_timeout_int", "str")Both survive dedup (different logical names). Output keys are
"request_timeout"and"request_timeout_int"— the"request_timeout_int"field is silently renamed to itself (accidentally correct), but"request_timeout"is extracted from the wrong column.Impact
_str,_int, or_float(common in structured logging:response_code_int,duration_ms_float,level_str, etc.)--validatesucceedsAffected Code
crates/logfwd-output/src/lib.rs:86-94(parse_column_name— suffix stripping too aggressive)crates/logfwd-output/src/lib.rs:111-143(build_col_infos— dedup silently drops columns)Fix
parse_column_nameshould only strip the suffix from names that logfwd's own scanner added. One approach: use a distinct separator (e.g.,::or#) that cannot appear in JSON keys. Another approach: document that field names ending in_str/_int/_floatproduce ambiguous output and add a warning.Short-term fix: skip suffix stripping if the result would be an empty prefix, or track which column names were scanner-generated vs. user-defined.