Skip to content

Fix: field names ending in _str/_int/_float are silently renamed/dropped in output JSON #407

@strawgate

Description

@strawgate

Summary

build_col_infos() in logfwd-output/src/lib.rs uses parse_column_name() to strip type suffixes (_str, _int, _float) from Arrow column names before serializing to JSON. parse_column_name uses rfind('_') to find the suffix — but this misidentifies user-defined field names that naturally end with _str, _int, or _float.

Result: A JSON field named status_int with a string value ("ok") is silently renamed to status in output, and the value from a numeric status field wins by dedup priority — completely wrong output.

Reproduction

cargo build --release -p logfwd

# Row 1: field "status_int" has string value "ok", field "count" has int 5
# Row 2: field "status_int" has int value 404 (control)
cat > /tmp/rename_test.log <<'LOGEOF'
{"status_int": "ok", "count": 5}
{"status_int": 404, "count": 3}
LOGEOF

cat > /tmp/rename_test.yaml <<'YAMLEOF'
input:
  type: file
  path: /tmp/rename_test.log
  format: json
output:
  type: stdout
  format: json
YAMLEOF

./target/release/logfwd --config /tmp/rename_test.yaml &
PID=$!; sleep 2; kill $PID 2>/dev/null; wait

Expected Output

{"status_int":"ok","count":5}
{"status_int":404,"count":3}

Actual Output

{"count":5}
{"status_int":404,"count":3}

Row 1: status_int value "ok" is completely missing. No error logged.

Root Cause

Step 1 — Scanner: {"status_int": "ok"} produces two Arrow columns:

  • status_int_str (the string "ok")
  • status_int_int (the int 404, null for row 1)

Step 2 — parse_column_name (lib.rs:86-94):

pub fn parse_column_name(col_name: &str) -> (&str, &str) {
    if let Some(pos) = col_name.rfind('_') {
        let suffix = &col_name[pos + 1..];
        if matches!(suffix, "str" | "int" | "float") {
            return (&col_name[..pos], suffix);
        }
    }
    (col_name, "")
}
  • "status_int_str"("status_int", "str") ✓ correct
  • "status_int_int"("status_int", "int") ✓ correct
  • Both map to logical field "status_int". ✓ good so far.

Step 3 — build_col_infos dedup (lib.rs:134-140): Keeps int over str (priority 3 > 1). status_int_str column is discarded.

Step 4 — write_row_json (lib.rs:164): Row 1 has status_int_int = NULL. arr.is_null(row)true → field silently skipped. Row 1 loses status_int entirely.

More Severe Case: Pure string field named with _int suffix

{"request_timeout_int": "none", "id": 1}

The scanner produces:

  • request_timeout_int_str → logical field "request_timeout_int", suffix "str"

After parse_column_name("request_timeout_int_str"):

  • rfind('_') → position before "str"
  • Returns ("request_timeout_int", "str")

Output JSON key becomes "request_timeout_int" — the user's actual field name. This case works correctly. But if the user has a separate "request_timeout" field with an integer:

  • request_timeout_int("request_timeout", "int")
  • request_timeout_int_str("request_timeout_int", "str")

Both survive dedup (different logical names). Output keys are "request_timeout" and "request_timeout_int" — the "request_timeout_int" field is silently renamed to itself (accidentally correct), but "request_timeout" is extracted from the wrong column.

Impact

  • Silent data loss and wrong values whenever a user's JSON schema uses field names ending in _str, _int, or _float (common in structured logging: response_code_int, duration_ms_float, level_str, etc.)
  • No error is logged at any stage
  • The output is wrong even when --validate succeeds

Affected Code

  • crates/logfwd-output/src/lib.rs:86-94 (parse_column_name — suffix stripping too aggressive)
  • crates/logfwd-output/src/lib.rs:111-143 (build_col_infos — dedup silently drops columns)

Fix

parse_column_name should only strip the suffix from names that logfwd's own scanner added. One approach: use a distinct separator (e.g., :: or #) that cannot appear in JSON keys. Another approach: document that field names ending in _str/_int/_float produce ambiguous output and add a warning.

Short-term fix: skip suffix stripping if the result would be an empty prefix, or track which column names were scanner-generated vs. user-defined.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions