Skip to content

fix: address post-merge CodeRabbit feedback on #684 (suffix-on-conflict)#758

Closed
strawgate wants to merge 13 commits into
masterfrom
feat/suffix-on-conflict-445
Closed

fix: address post-merge CodeRabbit feedback on #684 (suffix-on-conflict)#758
strawgate wants to merge 13 commits into
masterfrom
feat/suffix-on-conflict-445

Conversation

@strawgate

Copy link
Copy Markdown
Owner

Summary

Follow-up to #684 addressing CodeRabbit feedback that landed after merge:

  • docs/COLUMN_NAMING.md: Updated to double-underscore suffixes (__int, __str, __float), documented logfwd.conflict_groups metadata key, bare synthesized column for SQL, int()/float() UDF idiom, and cross-batch type instability warning
  • cross-batch test: Added cross_batch_int_udf_works_on_clean_and_conflict_batches in scanner_datafusion_boundary.rs documenting that int(status) resolves correctly on both clean and conflict batches
  • conflict_schema.rs: Added comment explaining why Utf8 cast intentionally loses zero-copy for StreamingBuilder str columns (SQL transform path only — acceptable trade-off)
  • storage_builder.rs: Added comment explaining _raw pre-reservation condition
  • streaming_builder.rs: Use crate::CONFLICT_GROUPS_METADATA_KEY (already re-exported) instead of crate::storage_builder::CONFLICT_GROUPS_METADATA_KEY
  • scanner.rs: Renamed test field sstatus in test_type_conflict for clarity

Test plan

  • cargo test -p logfwd-arrow -p logfwd-transform — all 23 tests pass including new cross-batch test
  • cargo fmt --check — clean
  • cargo clippy -- -D warnings — clean

🤖 Generated with Claude Code

strawgate and others added 13 commits April 2, 2026 17:26
…#445)

Single-type fields now use bare column names (`status`, `level`) with their
native Arrow type. Suffixed names (`status_int`, `status_str`) only appear
when the same field has multiple types within a single batch.

Changes:
- StreamingBuilder + StorageBuilder: conflict detection in finish_batch()
- Delete rewriter.rs (775 lines of dead SQL text rewriter, never wired in)
- json_extract UDF: suffix_order tries bare name as fallback; non-numeric
  columns return null instead of coercing strings to numbers
- All tests updated to expect bare names for single-type fields
- CI: set RUSTC_WRAPPER="" so cargo test works without sccache

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- json_extract: add regression tests for json_int/json_float returning
  null when the field is a quoted string (bare Utf8 column, no conflict)
- scanner_conformance oracle: panic on type mismatch instead of silently
  skipping assertions; fall back bare→suffixed column lookup so single-
  type fields are actually verified
- scanner_conformance oracle: accept Int64 column when checking a float
  value (e.g. -0 has no decimal/exponent so scanner emits Int64)
- assert_builders_consistent: assert st_col is also string-typed before
  casting to Utf8 for value comparison

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…atches (#625)

When the scanner detects a type conflict for a field (e.g. `status` appears
as both int and string across rows), it emits `status_int: Int64` and
`status_str: Utf8View`. SQL using the bare name `status` would fail to resolve.

Add `normalize_conflict_columns()` in the new `conflict_schema` module:
- Detects conflict groups: ≥2 suffixed variants of the same base name with
  no existing bare column
- Adds a computed `status: Utf8` column via COALESCE(int→str, float→str, str)
- Single lone `foo_str` columns (field literally named `foo_str`) are NOT
  treated as conflicts (require ≥2 variants)

Wire into `SqlTransform::execute()` before MemTable registration so every
batch is normalized before DataFusion sees it.

After this change `SELECT status FROM logs` works on both clean batches
(bare `status: Int64` from scanner) and conflict batches (synthesized
`status: Utf8`). Users call `int(status)` / `float(status)` for numeric ops.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…, ConflictGroups

- type-suffix-redesign.md: document __ suffix convention, logfwd.conflict_groups
  metadata key, ConflictGroups/TypedValue output abstraction, implementation phases
- column-type-constraints.md: answer all 6 open questions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ct_groups metadata

Rename conflict column suffixes from single-underscore (_int/_str/_float) to
double-underscore (__int/__str/__float) so real field names like `status_int`
cannot collide with synthesized conflict columns.

Stamp `logfwd.conflict_groups` Arrow schema metadata key when builders detect
type conflicts (format: "status:int,str;duration:float,int"). Zero overhead when
no conflicts in the batch.

Updates:
- storage_builder + streaming_builder: emit __int/__float/__str on conflict,
  accumulate conflict_meta, attach CONFLICT_GROUPS_METADATA_KEY to schema
- lib.rs: re-export CONFLICT_GROUPS_METADATA_KEY
- conflict_schema.rs: CONFLICT_SUFFIXES → __ prefix, HashMap → BTreeMap for
  deterministic ordering, preserve schema metadata in normalize_conflict_columns
- json_extract.rs: suffix_order updated for __ prefixes + bare-name fallback
- All tests updated: compliance_data, scanner_conformance, scanner_datafusion_boundary
  (Section 5 conflict tests: 4 new tests for bare-name SQL on conflict batches)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… test fixtures

The strip_type_suffix helper was stripping single-underscore suffixes (_str,
_int, _float), which caused two bugs after Phase 10b:

1. Real JSON field names like `start_int` would be incorrectly mapped to
   `start` for scanner pushdown — the exact ambiguity Phase 10b was meant
   to eliminate.
2. `strip_type_suffix("status__int")` returned `"status_"` (strips trailing
   `_int`, leaves a dangling underscore), giving the scanner a garbage key.

Fix: strip __str/__int/__float (double underscore) only. For bare names (the
normal post-Phase-10 case) the function is now a no-op, which is correct.

Also:
- Update stale comments in lib.rs (line 106 and 600) and the Section 5 block
  in scanner_datafusion_boundary.rs to reference __ suffixes consistently.
- Update make_test_batch() and all 14 dependent unit tests to use bare column
  names (level, msg, status, host, region, val, n) instead of the old
  level_str/msg_str/… convention, matching what the Phase-10 scanner emits.
- test_filter_hints_typed_column_stripped now tests severity__int rather than
  severity_int, exercising the actual double-underscore strip path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- scanner: assert legacy _str/_int suffixed columns absent in single-type test
- streaming_builder + storage_builder: detect duplicate output column names (reserve_name guard)
- scanner_conformance: panic on missing scanner columns and wrong types instead of silently skipping
- integration: remove stale {field}_{type} and team_str comment references

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Commit 085c2cf changed BatchMetadata.resource_attrs to Arc<Vec<...>> but
didn't update the test constructors, breaking compilation on Linux/macOS CI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Same 085c2cf regression — missed two more callsites in
elasticsearch_arrow_ipc.rs and elasticsearch_arrow.rs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
conflict_schema: use logfwd.conflict_groups metadata as authoritative
discriminator instead of suffix heuristics — prevents false-positive
synthesis on user fields literally named foo__int / foo__str

json_extract: Str mode coalesces all conflict variants row-by-row so
json(_raw,'status') returns "200" not null when status is int in that row

scanner_conformance: tighten Int64 fallback tolerance from < 1.0 to exact
equality; fix it uses col.unwrap_or_else (already done); fix loose test

scanner_datafusion_boundary: stamp logfwd.conflict_groups metadata on
test conflict batches to match real builder output

docs: mark AnalyzerRule/TableProvider section as planned (#625), not
current; fix C3 claim in column-type-constraints.md; refresh phase
breakdown in type-suffix-redesign.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Explains that C3 is satisfied via the planned TableProvider/AnalyzerRule
approach: referenced columns are advertised as Utf8 with cast rewrites to
typed backing columns, only for columns the query actually uses.
normalize_conflict_columns is a batch-level approximation; #625 replaces it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- docs/COLUMN_NAMING.md: update to double-underscore suffixes (__int,
  __str, __float), document logfwd.conflict_groups metadata key, bare
  synthesized column for SQL, int()/float() UDF idiom, and cross-batch
  type instability warning
- scanner_datafusion_boundary.rs: add cross_batch_int_udf_works test
  documenting that int(status) works on both clean and conflict batches
- conflict_schema.rs: explain why Utf8 cast intentionally loses zero-copy
  for StreamingBuilder str columns (SQL transform path only)
- storage_builder.rs: comment on _raw pre-reservation condition
- streaming_builder.rs: use crate::CONFLICT_GROUPS_METADATA_KEY instead
  of crate::storage_builder::CONFLICT_GROUPS_METADATA_KEY (already
  re-exported at crate root)
- scanner.rs: rename test field s → status in test_type_conflict

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Apr 2, 2026

Copy link
Copy Markdown

Caution

Review failed

Pull request was closed or merged during review

Walkthrough

This pull request redesigns type-conflict column naming and schema representation across the logfwd logging pipeline. Single-type JSON fields now emit bare column names (status, level), while multi-type fields use double-underscore-suffixed variants (status__int, status__str). A new CONFLICT_GROUPS_METADATA_KEY constant stores conflict metadata in Arrow schema, and a new normalize_conflict_columns() function synthesizes bare columns for SQL resolution when needed. The previous rewriter.rs module is removed and replaced with metadata-driven normalization. All builders, scanners, transforms, and tests are updated to reflect the new naming scheme. A minor CI change makes the macOS test job explicit about compiler-wrapper behavior.

Possibly related PRs

✨ Finishing Touches
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch feat/suffix-on-conflict-445

Comment @coderabbitai help to get the list of available commands and usage tips.

@strawgate

Copy link
Copy Markdown
Owner Author

Closing in favor of a conflict-free branch. Follow-up PR coming.

@strawgate strawgate closed this Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant