fix: address post-merge CodeRabbit feedback on #684 (suffix-on-conflict)#758
fix: address post-merge CodeRabbit feedback on #684 (suffix-on-conflict)#758strawgate wants to merge 13 commits into
Conversation
…#445) Single-type fields now use bare column names (`status`, `level`) with their native Arrow type. Suffixed names (`status_int`, `status_str`) only appear when the same field has multiple types within a single batch. Changes: - StreamingBuilder + StorageBuilder: conflict detection in finish_batch() - Delete rewriter.rs (775 lines of dead SQL text rewriter, never wired in) - json_extract UDF: suffix_order tries bare name as fallback; non-numeric columns return null instead of coercing strings to numbers - All tests updated to expect bare names for single-type fields - CI: set RUSTC_WRAPPER="" so cargo test works without sccache Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- json_extract: add regression tests for json_int/json_float returning null when the field is a quoted string (bare Utf8 column, no conflict) - scanner_conformance oracle: panic on type mismatch instead of silently skipping assertions; fall back bare→suffixed column lookup so single- type fields are actually verified - scanner_conformance oracle: accept Int64 column when checking a float value (e.g. -0 has no decimal/exponent so scanner emits Int64) - assert_builders_consistent: assert st_col is also string-typed before casting to Utf8 for value comparison Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…atches (#625) When the scanner detects a type conflict for a field (e.g. `status` appears as both int and string across rows), it emits `status_int: Int64` and `status_str: Utf8View`. SQL using the bare name `status` would fail to resolve. Add `normalize_conflict_columns()` in the new `conflict_schema` module: - Detects conflict groups: ≥2 suffixed variants of the same base name with no existing bare column - Adds a computed `status: Utf8` column via COALESCE(int→str, float→str, str) - Single lone `foo_str` columns (field literally named `foo_str`) are NOT treated as conflicts (require ≥2 variants) Wire into `SqlTransform::execute()` before MemTable registration so every batch is normalized before DataFusion sees it. After this change `SELECT status FROM logs` works on both clean batches (bare `status: Int64` from scanner) and conflict batches (synthesized `status: Utf8`). Users call `int(status)` / `float(status)` for numeric ops. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…, ConflictGroups - type-suffix-redesign.md: document __ suffix convention, logfwd.conflict_groups metadata key, ConflictGroups/TypedValue output abstraction, implementation phases - column-type-constraints.md: answer all 6 open questions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ct_groups metadata Rename conflict column suffixes from single-underscore (_int/_str/_float) to double-underscore (__int/__str/__float) so real field names like `status_int` cannot collide with synthesized conflict columns. Stamp `logfwd.conflict_groups` Arrow schema metadata key when builders detect type conflicts (format: "status:int,str;duration:float,int"). Zero overhead when no conflicts in the batch. Updates: - storage_builder + streaming_builder: emit __int/__float/__str on conflict, accumulate conflict_meta, attach CONFLICT_GROUPS_METADATA_KEY to schema - lib.rs: re-export CONFLICT_GROUPS_METADATA_KEY - conflict_schema.rs: CONFLICT_SUFFIXES → __ prefix, HashMap → BTreeMap for deterministic ordering, preserve schema metadata in normalize_conflict_columns - json_extract.rs: suffix_order updated for __ prefixes + bare-name fallback - All tests updated: compliance_data, scanner_conformance, scanner_datafusion_boundary (Section 5 conflict tests: 4 new tests for bare-name SQL on conflict batches) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… test fixtures
The strip_type_suffix helper was stripping single-underscore suffixes (_str,
_int, _float), which caused two bugs after Phase 10b:
1. Real JSON field names like `start_int` would be incorrectly mapped to
`start` for scanner pushdown — the exact ambiguity Phase 10b was meant
to eliminate.
2. `strip_type_suffix("status__int")` returned `"status_"` (strips trailing
`_int`, leaves a dangling underscore), giving the scanner a garbage key.
Fix: strip __str/__int/__float (double underscore) only. For bare names (the
normal post-Phase-10 case) the function is now a no-op, which is correct.
Also:
- Update stale comments in lib.rs (line 106 and 600) and the Section 5 block
in scanner_datafusion_boundary.rs to reference __ suffixes consistently.
- Update make_test_batch() and all 14 dependent unit tests to use bare column
names (level, msg, status, host, region, val, n) instead of the old
level_str/msg_str/… convention, matching what the Phase-10 scanner emits.
- test_filter_hints_typed_column_stripped now tests severity__int rather than
severity_int, exercising the actual double-underscore strip path.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- scanner: assert legacy _str/_int suffixed columns absent in single-type test
- streaming_builder + storage_builder: detect duplicate output column names (reserve_name guard)
- scanner_conformance: panic on missing scanner columns and wrong types instead of silently skipping
- integration: remove stale {field}_{type} and team_str comment references
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Commit 085c2cf changed BatchMetadata.resource_attrs to Arc<Vec<...>> but didn't update the test constructors, breaking compilation on Linux/macOS CI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Same 085c2cf regression — missed two more callsites in elasticsearch_arrow_ipc.rs and elasticsearch_arrow.rs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
conflict_schema: use logfwd.conflict_groups metadata as authoritative discriminator instead of suffix heuristics — prevents false-positive synthesis on user fields literally named foo__int / foo__str json_extract: Str mode coalesces all conflict variants row-by-row so json(_raw,'status') returns "200" not null when status is int in that row scanner_conformance: tighten Int64 fallback tolerance from < 1.0 to exact equality; fix it uses col.unwrap_or_else (already done); fix loose test scanner_datafusion_boundary: stamp logfwd.conflict_groups metadata on test conflict batches to match real builder output docs: mark AnalyzerRule/TableProvider section as planned (#625), not current; fix C3 claim in column-type-constraints.md; refresh phase breakdown in type-suffix-redesign.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Explains that C3 is satisfied via the planned TableProvider/AnalyzerRule approach: referenced columns are advertised as Utf8 with cast rewrites to typed backing columns, only for columns the query actually uses. normalize_conflict_columns is a batch-level approximation; #625 replaces it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- docs/COLUMN_NAMING.md: update to double-underscore suffixes (__int, __str, __float), document logfwd.conflict_groups metadata key, bare synthesized column for SQL, int()/float() UDF idiom, and cross-batch type instability warning - scanner_datafusion_boundary.rs: add cross_batch_int_udf_works test documenting that int(status) works on both clean and conflict batches - conflict_schema.rs: explain why Utf8 cast intentionally loses zero-copy for StreamingBuilder str columns (SQL transform path only) - storage_builder.rs: comment on _raw pre-reservation condition - streaming_builder.rs: use crate::CONFLICT_GROUPS_METADATA_KEY instead of crate::storage_builder::CONFLICT_GROUPS_METADATA_KEY (already re-exported at crate root) - scanner.rs: rename test field s → status in test_type_conflict Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Caution Review failedPull request was closed or merged during review WalkthroughThis pull request redesigns type-conflict column naming and schema representation across the logfwd logging pipeline. Single-type JSON fields now emit bare column names ( Possibly related PRs
✨ Finishing Touches⚔️ Resolve merge conflicts
Comment |
|
Closing in favor of a conflict-free branch. Follow-up PR coming. |
Summary
Follow-up to #684 addressing CodeRabbit feedback that landed after merge:
__int,__str,__float), documentedlogfwd.conflict_groupsmetadata key, bare synthesized column for SQL,int()/float()UDF idiom, and cross-batch type instability warningcross_batch_int_udf_works_on_clean_and_conflict_batchesinscanner_datafusion_boundary.rsdocumenting thatint(status)resolves correctly on both clean and conflict batches_rawpre-reservation conditioncrate::CONFLICT_GROUPS_METADATA_KEY(already re-exported) instead ofcrate::storage_builder::CONFLICT_GROUPS_METADATA_KEYs→statusintest_type_conflictfor clarityTest plan
cargo test -p logfwd-arrow -p logfwd-transform— all 23 tests pass including new cross-batch testcargo fmt --check— cleancargo clippy -- -D warnings— clean🤖 Generated with Claude Code