Skip to content

fix: NaN panic in metric sort + warn on unsupported input formats#572

Closed
strawgate wants to merge 5 commits into
masterfrom
worktree-type-suffix-redesign
Closed

fix: NaN panic in metric sort + warn on unsupported input formats#572
strawgate wants to merge 5 commits into
masterfrom
worktree-type-suffix-redesign

Conversation

@strawgate
Copy link
Copy Markdown
Owner

Summary

Two bug fixes:

Also closes via audit verification:

Test plan

  • cargo clippy -- -D warnings clean
  • Full workspace tests pass
  • CI passes

Fixes #502, #499

🤖 Generated with Claude Code

strawgate and others added 4 commits April 1, 2026 00:48
The JSON and console output sinks dispatched on the column name suffix
("_int", "_float", "_str") to decide how to serialize values. This was
fragile after SQL transforms (which produce columns without suffixes)
and caused silent data loss when a field had multiple type variants
across rows (the old dedup picked one column and dropped the rest).

Changes:
- ColInfo now groups all typed variants of a field (e.g. status_int +
  status_str) instead of deduplicating to one winner
- write_row_json iterates variants per row, picks first non-null,
  dispatches on Arrow DataType (Int64 → number, Float64 → number,
  anything else → quoted string)
- write_console uses the same DataType dispatch
- Extracted write_json_string and write_json_value helpers for reuse

This matches how OtlpSink already works (DataType dispatch since day 1).

Fixes #430 (SQL computed columns output empty strings)
Fixes #444 (float 1.0 output as integer)
Fixes #428 (boolean values coerced to quoted strings)
Fixes #442 (JSON null values silently dropped from multi-type fields)
Fixes #404 (mixed int/float silently dropped)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add criterion benchmark for the JSON serialization hot path
(build_col_infos + write_row_json on 10K rows). Measures 3.88M rows/sec
on Apple Silicon, ~48% of end-to-end time.

Make ColInfo, build_col_infos, write_row_json pub (was pub(crate)) so
the bench crate can access them. Also removes stale RUSTSEC-2024-0437
advisory ignore from deny.toml.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use itoa/ryu for zero-allocation numeric formatting in write_json_value
  and write_console (both already transitive deps via serde_json/ureq)
- Fix write_console well-known field skip to check ALL variant indices,
  not just primary_idx — prevents duplicate emission when find_col matched
  a different variant than the primary
- Remove dead primary_idx field from ColInfo (no longer referenced)
- Make ColInfo fields pub for bench crate access

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix partial_cmp().unwrap() panic on NaN scores in metric_history.rs
  by falling back to Equal ordering (#502)
- Explicitly match Json/Raw in pipeline input format handling; warn
  and treat as raw for unsupported formats (Logfmt, Syslog) instead
  of silently passing through to JSON scanner (#499)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 1, 2026

Walkthrough

This PR adds a Criterion benchmark measuring JSON serialization, refactors JSON output to group multiple typed column variants per field and to pick the first non-null variant per row, adds RFC‑8259 string escaping and datatype-driven value serialization (including non-finite float -> null), introduces itoa/ryu for numeric formatting, updates stdout sink to select per-row variants and skip well‑known fields by variant indices, changes metric history sorting to use f64::total_cmp, and tightens pipeline input format handling by rejecting unsupported formats earlier.

Possibly related PRs

✨ Finishing Touches
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch worktree-type-suffix-redesign

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/logfwd-io/src/metric_history.rs`:
- Line 116: The current comparator uses a.t.partial_cmp(&b.t).unwrap_or(Equal)
which breaks sort's total-order requirement; replace it to use the IEEE-754
total ordering via f64::total_cmp (e.g., call a.t.total_cmp(&b.t) or
f64::total_cmp(&a.t, &b.t)) in the comparator passed to all.sort_by so NaN and
other edge cases are handled correctly while keeping the sort stable and
well-defined.

In `@crates/logfwd-output/src/stdout.rs`:
- Around line 207-213: The Float64 branch in the DataType::Float64 match uses
ryu::format_finite which is UB for NaN/Infinity; update this branch (the
DataType::Float64 match arm that calls
arr.as_primitive::<arrow::datatypes::Float64Type>().value(row) and then
self.buf.extend_from_slice(...format_finite...)) to check v.is_finite() first
and only call ryu::Buffer::new().format_finite(v) for finite values; for
non-finite values follow the same fallback used in write_json_value (e.g., write
"null" or another safe representation) instead of calling format_finite.

In `@crates/logfwd/src/pipeline.rs`:
- Around line 502-508: The eprintln! in input_poll_loop causes hot-path
allocations and stderr I/O for every chunk when unsupported formats like
Logfmt/Syslog/Console are configured; update build_input_state to validate the
parsed input format (in the function build_input_state) and return an Err(...)
if the format is Logfmt, Syslog, or Console so these cases fail at construction
time, then remove the catch-all `_` arm in input_poll_loop (the branch currently
extending input.buf and printing the warning) so runtime no longer does
per-chunk logging or allocation for unsupported formats.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: f5aef861-2138-4ff0-aa24-d60ddaa3a491

📥 Commits

Reviewing files that changed from the base of the PR and between 0321b18 and 0c37b83.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (6)
  • crates/logfwd-bench/benches/pipeline.rs
  • crates/logfwd-io/src/metric_history.rs
  • crates/logfwd-output/Cargo.toml
  • crates/logfwd-output/src/lib.rs
  • crates/logfwd-output/src/stdout.rs
  • crates/logfwd/src/pipeline.rs

Comment thread crates/logfwd-io/src/metric_history.rs Outdated
Comment thread crates/logfwd-output/src/stdout.rs
Comment thread crates/logfwd/src/pipeline.rs
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two correctness issues should be fixed before merge:

  1. crates/logfwd-io/src/metric_history.rs:116 — using a.t.partial_cmp(&b.t).unwrap_or(Ordering::Equal) creates a non-total comparator when NaN is present, so sort_by can leave finite timestamps out of order. This breaks the intended chronological merge before dedup_by. Use a total order (for example f64::total_cmp) and explicitly decide NaN placement.

  2. crates/logfwd-output/src/stdout.rs:212ryu::Buffer::format_finite(v) is called without v.is_finite() check for Float64. For NaN/±Inf, ryu produces an unspecified numeric string, so console output can silently emit incorrect values. Guard non-finite values and emit a stable representation (null/NaN/inf) consistent with your output policy.

Optional follow-up: crates/logfwd/src/pipeline.rs:503 currently warns on every chunk for unsupported formats, which can flood stderr under load; consider warning once per input source or rate-limiting.

Warning

⚠️ Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

  • docs.rs

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "docs.rs"

See Network Configuration for more information.

Note

🔒 Integrity filtering filtered 2 items

Integrity filtering activated and filtered the following items during workflow execution.
This happens when a tool call accesses a resource that does not meet the required integrity or secrecy level of the workflow.

  • pr:#572 (pull_request_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)
  • issue:#499 (issue_read: has lower integrity than agent requires. The agent cannot read data with integrity below "approved".)

What is this? | From workflow: AI: PR Review

Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not.

- use f64::total_cmp for metric history sorting with NaN
- guard stdout Float64 formatting for non-finite values
- reject unsupported input formats at pipeline build time

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 1, 2026

Addressed all 3 unresolved review threads and pushed commit ed14181.

  • Resolved PRRT_kwDORzg8fM54JHFc: replaced partial_cmp().unwrap_or(...) with f64::total_cmp in crates/logfwd-io/src/metric_history.rs.
  • Resolved PRRT_kwDORzg8fM54JHFn: guarded ryu::format_finite with is_finite() fallback to null in crates/logfwd-output/src/stdout.rs.
  • Resolved PRRT_kwDORzg8fM54JHFq: moved unsupported input-format handling to construction-time validation in build_input_state, removed hot-path warning behavior, and kept a non-exhaustive wildcard as unreachable! in crates/logfwd/src/pipeline.rs.
Validation
  • RUSTC_WRAPPER= cargo clippy -p logfwd-io -p logfwd-output -p logfwd --all-targets -- -D warnings ✅ (for touched crates after changes)
  • RUSTC_WRAPPER= cargo test -p logfwd-io -p logfwd-output -p logfwd ⚠️ did not complete in this runner (stalled for extended time while compiling logfwd test artifacts); no test failures were emitted before timeout.
  • Full-workspace clippy and test were also attempted, but full clippy is currently blocked by a pre-existing bench lint in crates/logfwd-core/benches/scanner.rs (explicit_iter_loop) unrelated to this PR.

Note

🔒 Integrity filtering filtered 1 item

Integrity filtering activated and filtered the following item during workflow execution.
This happens when a tool call accesses a resource that does not meet the required integrity or secrecy level of the workflow.


What is this? | From workflow: AI: Address PR Review Feedback

Give us feedback! React with 🚀 if perfect, 👍 if helpful, 👎 if not.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/logfwd-output/src/stdout.rs`:
- Around line 166-172: The console output skips entire grouped columns when any
variant matches ts_idx/level_idx/msg_idx, causing mixed-type fields (e.g.,
msg_str vs msg_int) to be dropped; update the logic in stdout.rs around the
col.field_name == "_raw" guard to resolve well-known fields by scanning cols for
the first non-null variant the same way as in crates/logfwd-output/src/lib.rs
(lines ~226-255) — pick the first valid variant index for timestamp, level and
message from the group and use those resolved indices for rendering, then skip
by the group name (column.field_name) instead of by individual variant; apply
the same fix to the similar block at the other occurrence (around lines
177-185).

In `@crates/logfwd/src/pipeline.rs`:
- Around line 499-507: Add a regression test that asserts the
constructor/runtime format invariant by verifying build_input_state rejects
Logfmt, Syslog, and Console formats so input_poll_loop cannot hit the
unreachable! branch; specifically, write a small config-level test that calls
build_input_state (and any constructor wrapper used at runtime) with
input.format set to Format::Logfmt, Format::Syslog, and Format::Console and
expects an Err for each case, and mirror the same assertions for the second site
referenced around the other unreachable! (the block around the lines 628-633) to
ensure both code paths fail fast rather than panicking in input_poll_loop.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6b789098-fc7b-4a2f-bca0-50037dd5ed6c

📥 Commits

Reviewing files that changed from the base of the PR and between 0c37b83 and 5a39712.

📒 Files selected for processing (3)
  • crates/logfwd-io/src/metric_history.rs
  • crates/logfwd-output/src/stdout.rs
  • crates/logfwd/src/pipeline.rs

Comment on lines +166 to +172
// Skip the well-known columns and _raw. Check ALL variant
// indices — find_col may have matched a different variant
// (e.g. message_str vs message_int).
if col.field_name == "_raw"
|| col.variants.iter().any(|(idx, _)| {
Some(*idx) == ts_idx || Some(*idx) == level_idx || Some(*idx) == msg_idx
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Console output can drop mixed-type well-known fields.

This now skips the whole grouped field when any variant matches ts_idx / level_idx / msg_idx, but the timestamp/level/message rendering above still reads only one fixed column index. In a batch with both msg_str and msg_int, rows that populate only msg_int will print neither the main message nor msg=.... Resolve the well-known fields from cols with the same first-non-null-variant logic used in crates/logfwd-output/src/lib.rs:226-255, then skip by group.

Also applies to: 177-185

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-output/src/stdout.rs` around lines 166 - 172, The console
output skips entire grouped columns when any variant matches
ts_idx/level_idx/msg_idx, causing mixed-type fields (e.g., msg_str vs msg_int)
to be dropped; update the logic in stdout.rs around the col.field_name == "_raw"
guard to resolve well-known fields by scanning cols for the first non-null
variant the same way as in crates/logfwd-output/src/lib.rs (lines ~226-255) —
pick the first valid variant index for timestamp, level and message from the
group and use those resolved indices for rendering, then skip by the group name
(column.field_name) instead of by individual variant; apply the same fix to the
similar block at the other occurrence (around lines 177-185).

Comment on lines +499 to +507
Format::Json | Format::Raw => {
input.buf.extend_from_slice(&chunk);
}
_ => {
unreachable!(
"unsupported format should be rejected in build_input_state: {:?}",
input.format
);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Add regression coverage for the constructor/runtime format invariant.

input_poll_loop now depends on build_input_state rejecting Logfmt / Syslog / Console; if that guard regresses later, the failure mode becomes an input-thread panic via unreachable!. Please lock this down with a small config-level test that these input formats return Err.

Also applies to: 628-633

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd/src/pipeline.rs` around lines 499 - 507, Add a regression test
that asserts the constructor/runtime format invariant by verifying
build_input_state rejects Logfmt, Syslog, and Console formats so input_poll_loop
cannot hit the unreachable! branch; specifically, write a small config-level
test that calls build_input_state (and any constructor wrapper used at runtime)
with input.format set to Format::Logfmt, Format::Syslog, and Format::Console and
expects an Err for each case, and mirror the same assertions for the second site
referenced around the other unreachable! (the block around the lines 628-633) to
ensure both code paths fail fast rather than panicking in input_poll_loop.

@strawgate
Copy link
Copy Markdown
Owner Author

Closing — the format warning fix is superseded by #581 (input layer refactor). The NaN panic fix (#502) will be addressed separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

partial_cmp().unwrap() in ChunkTuner can panic on NaN scores

1 participant