Skip to content

feat: suffix column names only on type conflict, delete dead rewriter (#445)#684

Merged
strawgate merged 12 commits into
masterfrom
feat/suffix-on-conflict-445
Apr 2, 2026
Merged

feat: suffix column names only on type conflict, delete dead rewriter (#445)#684
strawgate merged 12 commits into
masterfrom
feat/suffix-on-conflict-445

Conversation

@strawgate
Copy link
Copy Markdown
Owner

Summary

Implements the "suffix only on conflict" column naming design from #445.

The output layer already dispatched on Arrow DataType rather than column name suffix (commit 0321b18), so serialization is unaffected.

Changes

File Change
streaming_builder.rs, storage_builder.rs Conflict detection in finish_batch() — bare name when single-type, suffixed on conflict
rewriter.rs Deleted (775 lines, dead code)
json_extract.rs suffix_order tries bare name as fallback; non-numeric columns return null instead of coercing "200"200
All test files Bare names for single-type fields; suffixed names preserved where genuine conflicts exist
ci.yml RUSTC_WRAPPER: "" so cargo test works in CI without sccache
type-suffix-redesign.md Updated VIEW reference → AnalyzerRule approach (see #625)

Test plan

  • All existing tests pass with bare names for single-type fields
  • Type-conflict tests still produce _int/_str/_float suffixes
  • json_extract UDF: json_int on a quoted string returns null (not the coerced int value)
  • CI passes

Closes

Partially closes #445 (steps 1-3 complete; schema padding and AnalyzerRule tracked in #625).

🤖 Generated with Claude Code

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 2, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Warning

.coderabbit.yaml has a parsing error

The CodeRabbit configuration file in this repository has a parsing error and default settings were used instead. Please fix the error(s) in the configuration file. You can initialize chat with CodeRabbit to get help with the configuration file.

💥 Parsing errors (1)
Validation error: Instructions must be under 10000 characters at "reviews.pre_merge_checks.custom_checks[1].instructions"
⚙️ Configuration instructions
  • Please see the configuration documentation for more information.
  • You can also validate your configuration using the online YAML validator.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Walkthrough

Builders now emit bare field names for single-type JSON fields and emit double-underscore type-suffixed variant columns (e.g., status__int, status__str) only for per-batch type conflicts. Builders write conflict-group metadata under logfwd.conflict_groups. The SQL rewriter was removed and replaced by a conflict-schema normalization step that synthesizes bare Utf8 columns from suffixed variants for SQL resolution. JSON-extraction UDFs prefer typed suffixes then fall back to bare names; numeric UDFs return NULL when the resolved scanner column is not numeric. CI macOS test clears RUSTC_WRAPPER for cargo test.

Possibly related PRs

🚥 Pre-merge checks | ✅ 1 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Out of Scope Changes check ❓ Inconclusive Most changes align with #445 objectives, but CI modification (RUSTC_WRAPPER environment variable) and documentation updates to unimplemented features (normalize_conflict_columns, ConflictGroups output abstraction) may exceed scope. Clarify whether CI environment changes are required for this PR or belong in a separate change, and confirm if documentation for Phase 10b/10c unimplemented features should land before or after implementation.
✅ Passed checks (1 passed)
Check name Status Explanation
Linked Issues check ✅ Passed PR implements all core requirements from #445: bare column names for single-type fields, suffixed columns (__int/__str/__float) only on conflicts, removes dead rewriter code (775 lines), and updates all downstream consumers.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
crates/logfwd-arrow/src/scanner.rs (1)

193-213: ⚠️ Potential issue | 🟡 Minor

Assert the legacy suffixed names are absent in the single-type path.

These assertions only prove that host and status exist. If the scanner accidentally emitted both host and host_str / status_int, this test would still pass and miss the main invariant of the redesign.

💡 Tighten the regression
         assert_eq!(
             batch
                 .column_by_name("status")
                 .unwrap()
                 .as_any()
                 .downcast_ref::<Int64Array>()
                 .unwrap()
                 .value(1),
             404
         );
+        assert!(
+            batch.column_by_name("host_str").is_none(),
+            "single-type string fields should not emit legacy suffixed columns"
+        );
+        assert!(
+            batch.column_by_name("status_int").is_none(),
+            "single-type int fields should not emit legacy suffixed columns"
+        );
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-arrow/src/scanner.rs` around lines 193 - 213, The test
currently only checks that "host" and "status" columns exist but not that legacy
suffixed names were omitted; update the single-type fields assertions to also
assert absence of the legacy suffixed names by checking
batch.column_by_name("host_str").is_none() and
batch.column_by_name("status_int").is_none() (or equivalent asserts), so the
test fails if the scanner emits both the new names and the old suffixed
variants; apply this change in the single-type fields block around the existing
asserts that reference batch and column_by_name.
crates/logfwd-arrow/src/streaming_builder.rs (1)

271-324: ⚠️ Potential issue | 🔴 Critical

Prevent emitted-name collisions before building the schema.

This naming rule can now generate the same output name for different fields: e.g. a mixed-type status emits status_int, while a separate single-type field literally named status_int also emits bare status_int. The same problem exists for a user field named _raw when keep_raw later adds the reserved _raw column. That yields an ambiguous batch where one array is effectively hidden by name.

💡 Fail fast on duplicate output names
+        let mut emitted_names = std::collections::HashSet::new();
+        if self.keep_raw && !self.raw_views.is_empty() {
+            emitted_names.insert("_raw".to_string());
+        }
+
+        let mut reserve_name = |name: &str| -> Result<(), ArrowError> {
+            if emitted_names.insert(name.to_string()) {
+                Ok(())
+            } else {
+                Err(ArrowError::InvalidArgumentError(format!(
+                    "duplicate output column name: {name}"
+                )))
+            }
+        };
+
         for fc in &self.fields {
             // Field names come from JSON keys (valid UTF-8 in well-formed input).
             // Use from_utf8_lossy so that fuzz inputs with arbitrary bytes are
             // handled gracefully instead of triggering undefined behaviour.
             let name = String::from_utf8_lossy(&fc.name);
@@
             if fc.has_int {
                 let col_name = if conflict {
                     format!("{}_int", name)
                 } else {
                     name.to_string()
                 };
+                reserve_name(&col_name)?;
                 let mut values = vec![0i64; num_rows];
                 let mut valid = vec![false; num_rows];
                 ...

Apply the same reserve_name(&col_name)? check in the float/string branches as well.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-arrow/src/streaming_builder.rs` around lines 271 - 324, The
schema-building can emit duplicate output names; ensure you call
reserve_name(&col_name)? before adding any column in the Float and Str branches
(i.e. inside the fc.has_float and fc.has_str blocks, just before
schema_fields.push and arrays.push) the same way the Int branch does, so
duplicate output names (including collisions with the reserved "_raw" when
keep_raw is used) return an error instead of silently hiding arrays; use the
existing reserve_name function and the local col_name variable in those
branches.
crates/logfwd-transform/src/udf/json_extract.rs (1)

242-257: ⚠️ Potential issue | 🟡 Minor

Add a regression for quoted numeric strings on the bare-name path.

The current suite still only covers conflict batches, so this branch never runs against a single-type string column like {"status":"200"} or {"duration":"1.5"}. That leaves the PR’s new “quoted numbers return NULL” behavior unpinned.

💡 Suggested regression cases
+    #[tokio::test]
+    async fn test_json_int_quoted_number_returns_null() {
+        let batch = make_raw_batch(vec![r#"{"status": "200"}"#]);
+        let result = query("SELECT json_int(_raw, 'status') as s FROM logs", batch).await;
+        let col = result
+            .column(0)
+            .as_any()
+            .downcast_ref::<Int64Array>()
+            .unwrap();
+        assert!(col.is_null(0));
+    }
+
+    #[tokio::test]
+    async fn test_json_float_quoted_number_returns_null() {
+        let batch = make_raw_batch(vec![r#"{"duration": "1.5"}"#]);
+        let result = query("SELECT json_float(_raw, 'duration') as d FROM logs", batch).await;
+        let col = result
+            .column(0)
+            .as_any()
+            .downcast_ref::<Float64Array>()
+            .unwrap();
+        assert!(col.is_null(0));
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-transform/src/udf/json_extract.rs` around lines 242 - 257, Add
regression tests that exercise the bare-name fallback path when the column is a
single-type string so the branches in JsonExtractMode::Int and
JsonExtractMode::Float run; specifically, create inputs like a string column
containing JSON rows {"status":"200"} and {"duration":"1.5"} and assert that
json_extract returns nulls (i.e. matches the code paths that produce
arrow::array::new_null_array(&DataType::Int64, ...) and
new_null_array(&DataType::Float64, ...)). Target the UDF/test that invokes the
JsonExtract logic (so the arr variable, DataType checks, and cast logic for
DataType::Int64/Float64 are covered) to pin the regression for quoted numeric
strings on the bare-name path.
crates/logfwd-core/tests/scanner_conformance.rs (2)

95-148: ⚠️ Potential issue | 🟠 Major

Make the suffixed→bare fallback fail on type mismatches.

Once col_name is chosen, these branches should fail if that column has the wrong array type. Right now they silently skip assertions when a type mismatch occurs, which can hide typing regressions and turn them into false-green conformance runs.

💡 Make the fallback strict
                 let col_name = if batch.column_by_name(&suffixed).is_some() {
                     suffixed
                 } else {
                     key_str.to_string()
                 };
-                if let Some(col) = batch.column_by_name(&col_name)
-                    && let Some(arr) = col.as_any().downcast_ref::<StringArray>()
-                    && !arr.is_null(row)
-                {
+                let col = batch
+                    .column_by_name(&col_name)
+                    .unwrap_or_else(|| panic!("missing expected column '{col_name}'"));
+                let arr = col
+                    .as_any()
+                    .downcast_ref::<StringArray>()
+                    .unwrap_or_else(|| {
+                        panic!("column '{col_name}' has wrong type: {:?}", col.data_type())
+                    });
+                if !arr.is_null(row) {
                     let actual = arr.value(row);
                     ...
                 }

Apply the same pattern to the Int64Array and Float64Array branches as well.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-core/tests/scanner_conformance.rs` around lines 95 - 148, The
branches pick col_name by falling back from suffixed to bare but currently
silently skip when the existing column has the wrong array type; change each
branch (StringArray, Int64Array, Float64Array handling around
batch.column_by_name, arr.as_any().downcast_ref and col_name) to first fetch
column by name and then assert/fail if the downcast to the expected array type
returns None (e.g.,
assert!(col.as_any().downcast_ref::<StringArray>().is_some(), "type mismatch for
{col_name} at row {row}: expected StringArray"), instead of silently skipping),
and apply the same strict check to the Int64Array and Float64Array branches so
type mismatches surface as test failures.

318-330: ⚠️ Potential issue | 🟡 Minor

Assert both st_col are string-like before casting to Utf8.

The Int64Array and Float64Array branches use if let Some() on s_col then unwrap() on st_col, establishing type agreement. The string branch only checks s_col's type before casting both columns to Utf8, allowing numeric st_col values to be silently stringified and hide type mismatches. Add an assertion matching the s_col check to catch builder inconsistencies.

Suggested fix
                 if matches!(
                     s_col.data_type(),
                     arrow::datatypes::DataType::Utf8
                         | arrow::datatypes::DataType::Utf8View
                         | arrow::datatypes::DataType::LargeUtf8
                 ) {
+                    assert!(
+                        matches!(
+                            st_col.data_type(),
+                            arrow::datatypes::DataType::Utf8
+                                | arrow::datatypes::DataType::Utf8View
+                                | arrow::datatypes::DataType::LargeUtf8
+                        ),
+                        "builder type mismatch at {col_name}: left={:?}, right={:?}",
+                        s_col.data_type(),
+                        st_col.data_type(),
+                    );
                     let s_val =
                         arrow::compute::cast(s_col, &arrow::datatypes::DataType::Utf8).unwrap();
                     let st_val =
                         arrow::compute::cast(st_col, &arrow::datatypes::DataType::Utf8).unwrap();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-core/tests/scanner_conformance.rs` around lines 318 - 330, The
string branch currently only checks s_col.data_type() before casting both s_col
and st_col to Utf8, which can mask type mismatches; update the check to assert
that st_col is string-like as well (mirror the pattern used in the
Int64Array/Float64Array branches): perform a type check or an if-let downcast on
s_col (e.g., downcast_ref::<StringArray>()) and likewise assert or use if-let on
st_col (st_col.as_any().downcast_ref::<StringArray>()) before calling
arrow::compute::cast or unwrap, so both s_col and st_col are confirmed
Utf8/string arrays prior to casting.
crates/logfwd/tests/integration.rs (1)

483-507: ⚠️ Potential issue | 🟡 Minor

Stale comments still describe the old suffix-based schema.

The code now uses bare names, but comments still reference {field}_{type} and team_str, which is misleading during test maintenance/debugging.

Proposed comment-only cleanup
-    // CSV columns use plain names (no `_str` suffix); scanner columns use the
-    // `{field}_{type}` convention.  The alias brings the enriched column into
-    // the logfwd naming scheme for downstream compatibility.
+    // Both scanner and CSV columns are addressed via bare names for this
+    // single-type dataset.
@@
-    // The output must contain the `team_str` column from the CSV.
+    // The output must contain the `team` column from the CSV.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd/tests/integration.rs` around lines 483 - 507, Update the stale
comments in the SqlTransform test: remove references to the old "{field}_{type}"
convention and "team_str" and instead state that CSV columns use bare names
(e.g. "team") and the SQL alias brings that field into the enriched output;
adjust the comment above the SQL string and the assertion comment that
references schema.field_with_name("team")/result.schema() to reflect the current
bare-name behavior and avoid mentioning suffix-based naming or legacy examples.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/logfwd-core/tests/compliance_data.rs`:
- Around line 702-703: The test uses misleading expect messages that reference
removed suffixed names; update the expect messages on the get_str/get_* calls
(e.g., the get_str(batch, "items", 0).expect("items_str") and similar get_*
calls like the one for "v_float") to reflect the actual bare lookup keys (e.g.,
expect("items") and expect("v_float")), and make the same change for the other
occurrence at the second location mentioned so the failure text matches the
queried key names; locate calls to get_str/get_* and replace the suffixed
strings in their expect(...) arguments accordingly.

In `@dev-docs/research/type-suffix-redesign.md`:
- Around line 61-73: Update the "AnalyzerRule + TableProvider for type-conflict
batches" section to clearly mark it as a planned design for issue `#625` and add
one sentence describing the current implementation: that code currently
registers a plain MemTable for tables without an AnalyzerRule/TableProvider, and
that conflict batches are not yet routed via AnalyzerRule/TableProvider in the
present implementation (see use of MemTable in the transform code). Use the
exact headings/terms "AnalyzerRule", "TableProvider", "MemTable", and reference
"#625" so readers know this is a future change and how the current behavior
differs.

---

Outside diff comments:
In `@crates/logfwd-arrow/src/scanner.rs`:
- Around line 193-213: The test currently only checks that "host" and "status"
columns exist but not that legacy suffixed names were omitted; update the
single-type fields assertions to also assert absence of the legacy suffixed
names by checking batch.column_by_name("host_str").is_none() and
batch.column_by_name("status_int").is_none() (or equivalent asserts), so the
test fails if the scanner emits both the new names and the old suffixed
variants; apply this change in the single-type fields block around the existing
asserts that reference batch and column_by_name.

In `@crates/logfwd-arrow/src/streaming_builder.rs`:
- Around line 271-324: The schema-building can emit duplicate output names;
ensure you call reserve_name(&col_name)? before adding any column in the Float
and Str branches (i.e. inside the fc.has_float and fc.has_str blocks, just
before schema_fields.push and arrays.push) the same way the Int branch does, so
duplicate output names (including collisions with the reserved "_raw" when
keep_raw is used) return an error instead of silently hiding arrays; use the
existing reserve_name function and the local col_name variable in those
branches.

In `@crates/logfwd-core/tests/scanner_conformance.rs`:
- Around line 95-148: The branches pick col_name by falling back from suffixed
to bare but currently silently skip when the existing column has the wrong array
type; change each branch (StringArray, Int64Array, Float64Array handling around
batch.column_by_name, arr.as_any().downcast_ref and col_name) to first fetch
column by name and then assert/fail if the downcast to the expected array type
returns None (e.g.,
assert!(col.as_any().downcast_ref::<StringArray>().is_some(), "type mismatch for
{col_name} at row {row}: expected StringArray"), instead of silently skipping),
and apply the same strict check to the Int64Array and Float64Array branches so
type mismatches surface as test failures.
- Around line 318-330: The string branch currently only checks s_col.data_type()
before casting both s_col and st_col to Utf8, which can mask type mismatches;
update the check to assert that st_col is string-like as well (mirror the
pattern used in the Int64Array/Float64Array branches): perform a type check or
an if-let downcast on s_col (e.g., downcast_ref::<StringArray>()) and likewise
assert or use if-let on st_col (st_col.as_any().downcast_ref::<StringArray>())
before calling arrow::compute::cast or unwrap, so both s_col and st_col are
confirmed Utf8/string arrays prior to casting.

In `@crates/logfwd-transform/src/udf/json_extract.rs`:
- Around line 242-257: Add regression tests that exercise the bare-name fallback
path when the column is a single-type string so the branches in
JsonExtractMode::Int and JsonExtractMode::Float run; specifically, create inputs
like a string column containing JSON rows {"status":"200"} and
{"duration":"1.5"} and assert that json_extract returns nulls (i.e. matches the
code paths that produce arrow::array::new_null_array(&DataType::Int64, ...) and
new_null_array(&DataType::Float64, ...)). Target the UDF/test that invokes the
JsonExtract logic (so the arr variable, DataType checks, and cast logic for
DataType::Int64/Float64 are covered) to pin the regression for quoted numeric
strings on the bare-name path.

In `@crates/logfwd/tests/integration.rs`:
- Around line 483-507: Update the stale comments in the SqlTransform test:
remove references to the old "{field}_{type}" convention and "team_str" and
instead state that CSV columns use bare names (e.g. "team") and the SQL alias
brings that field into the enriched output; adjust the comment above the SQL
string and the assertion comment that references
schema.field_with_name("team")/result.schema() to reflect the current bare-name
behavior and avoid mentioning suffix-based naming or legacy examples.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6c66182f-15c2-4a97-982e-30884a6cec9b

📥 Commits

Reviewing files that changed from the base of the PR and between 3865eca and b6159ca.

📒 Files selected for processing (15)
  • .github/workflows/ci.yml
  • crates/logfwd-arrow/src/scanner.rs
  • crates/logfwd-arrow/src/storage_builder.rs
  • crates/logfwd-arrow/src/streaming_builder.rs
  • crates/logfwd-core/tests/compliance_data.rs
  • crates/logfwd-core/tests/scanner_conformance.rs
  • crates/logfwd-transform/src/lib.rs
  • crates/logfwd-transform/src/rewriter.rs
  • crates/logfwd-transform/src/udf/json_extract.rs
  • crates/logfwd-transform/tests/raw_first_bench.rs
  • crates/logfwd-transform/tests/scanner_datafusion_boundary.rs
  • crates/logfwd/src/pipeline.rs
  • crates/logfwd/tests/compliance.rs
  • crates/logfwd/tests/integration.rs
  • dev-docs/research/type-suffix-redesign.md
💤 Files with no reviewable changes (2)
  • crates/logfwd-transform/src/lib.rs
  • crates/logfwd-transform/src/rewriter.rs

Comment thread crates/logfwd-core/tests/compliance_data.rs Outdated
Comment thread dev-docs/research/type-suffix-redesign.md Outdated
strawgate added a commit that referenced this pull request Apr 2, 2026
- json_extract: add regression tests for json_int/json_float returning
  null when the field is a quoted string (bare Utf8 column, no conflict)
- scanner_conformance oracle: panic on type mismatch instead of silently
  skipping assertions; fall back bare→suffixed column lookup so single-
  type fields are actually verified
- scanner_conformance oracle: accept Int64 column when checking a float
  value (e.g. -0 has no decimal/exponent so scanner emits Int64)
- assert_builders_consistent: assert st_col is also string-typed before
  casting to Utf8 for value comparison

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/logfwd-transform/tests/scanner_datafusion_boundary.rs (1)

570-590: 🧹 Nitpick | 🔵 Trivial

Assert the full ordered result here.

This now only checks the total row count and the ERROR bucket, so it would still pass if ORDER BY cnt DESC, level ASC were wrong for INFO or DEBUG. Please assert the full row order for this path.

✅ Stronger assertion
-    // ERROR×2 comes first (cnt DESC), then DEBUG×1 and INFO×2 tie — but INFO
-    // and ERROR both have 2, and ERROR sorts before INFO alphabetically when
-    // cnt is equal.  Let's just verify the total count.
-    let total: i64 = counts.iter().map(|v| v.unwrap_or(0)).sum();
-    assert_eq!(total, 5);
-    // ERROR must appear with count 2.
-    let error_pos = levels.iter().position(|v| v == "ERROR").unwrap();
-    assert_eq!(counts[error_pos], Some(2));
+    assert_eq!(levels, ["ERROR", "INFO", "DEBUG"]);
+    assert_eq!(counts, [Some(2), Some(2), Some(1)]);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-transform/tests/scanner_datafusion_boundary.rs` around lines
570 - 590, The test currently only checks total and ERROR bucket; instead assert
the full ordered result returned by
SqlTransform::new(...).execute_blocking(batch) by comparing the collected levels
and counts (from collect_string_col and collect_i64_col) to the expected ordered
vectors: levels should equal ["ERROR","INFO","DEBUG"] and counts should equal
[Some(2), Some(2), Some(1)] (or the appropriate Option<i64> form used in the
test); update the assertions to compare these exact vectors so ORDER BY cnt
DESC, level ASC is fully validated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/logfwd-transform/src/conflict_schema.rs`:
- Around line 64-77: The grouping logic using strip_conflict_suffix and groups
currently treats any shared suffixes as a conflict and may synthesize a bare
column for literal sibling keys; change the discriminator so groups only survive
if (a) the suffixes are in a vetted set of known conflict suffixes (e.g. a
KNOWN_CONFLICT_SUFFIXES list used by strip_conflict_suffix) AND (b) the member
fields in schema.fields() show conflicting representations (e.g. differing data
types or explicit scanner-emitted conflict metadata on the Field/Schema) —
otherwise drop the group; update the groups.retain call to perform these checks
(and reference existing_bare, strip_conflict_suffix, and the per-field
type/metadata) and add a regression test exercising literal siblings vs real
conflict variants.
- Around line 83-108: The appended synthesized columns come from `groups` (a
HashMap) and are added in arbitrary order causing schema flapping; convert
`groups` into a deterministic Vec sorted by the minimum source-column index of
each group's `members` before the for-loop that builds
`extra_fields`/`extra_arrays` (the code that calls `merge_to_utf8`, pushes to
`extra_fields` and `extra_arrays`, and uses `batch.num_rows()`); compute the min
index for each `(base, members)` entry, sort by that min, then iterate the
sorted vector so the order of pushed fields/arrays (and thus the resulting
`new_schema`) is stable across runs.

In `@crates/logfwd-transform/src/lib.rs`:
- Around line 598-603: The code currently calls
conflict_schema::normalize_conflict_columns(batch) which mutates the physical
RecordBatch (the `batch` used to construct the MemTable), causing synthetic bare
columns to become part of `logs` and breaking SELECT * / wildcard round-trips;
instead stop normalizing the RecordBatch before MemTable registration and move
the bare-name aliasing into the query planning/projection stage (so the MemTable
stores the original physical schema). Concretely: revert removal of columns from
the RecordBatch (undo the call to conflict_schema::normalize_conflict_columns
when building the MemTable), keep registering the original `batch` in the
MemTable, and implement the synthetic `status: Utf8` aliasing inside the
planner/projection codepath that prepares projections for execution (referencing
the same conflict_schema logic but applied to projection expressions rather than
mutating `batch`).

In `@crates/logfwd-transform/src/udf/json_extract.rs`:
- Around line 59-68: The suffix_order() mapping is fine but the lookup logic
that uses it (e.g., in the json(...) and json_float(...) extraction paths) must
not pick the first existing column at batch-level; instead implement row-wise
coalescing across the ordered suffix variants: for each field, generate an
expression that checks each variant column in suffix_order(self) in preference
order and returns the first non-null/valid per row (e.g., COALESCE or
conditional pick with IS NOT NULL and appropriate casting/parsing for
_int/_float/_str), rather than selecting a single column for the whole batch;
update the codepaths that currently short-circuit on the first matching column
(referencing functions json, json_float, and the suffix_order method) to build
per-row coalesce logic so mixed-type batches yield per-row values correctly.

---

Outside diff comments:
In `@crates/logfwd-transform/tests/scanner_datafusion_boundary.rs`:
- Around line 570-590: The test currently only checks total and ERROR bucket;
instead assert the full ordered result returned by
SqlTransform::new(...).execute_blocking(batch) by comparing the collected levels
and counts (from collect_string_col and collect_i64_col) to the expected ordered
vectors: levels should equal ["ERROR","INFO","DEBUG"] and counts should equal
[Some(2), Some(2), Some(1)] (or the appropriate Option<i64> form used in the
test); update the assertions to compare these exact vectors so ORDER BY cnt
DESC, level ASC is fully validated.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 01e14670-2d66-46b1-b4d9-2594531f03b8

📥 Commits

Reviewing files that changed from the base of the PR and between 7e63540 and 17bee59.

📒 Files selected for processing (5)
  • crates/logfwd-core/src/framer.rs
  • crates/logfwd-transform/src/conflict_schema.rs
  • crates/logfwd-transform/src/lib.rs
  • crates/logfwd-transform/src/udf/json_extract.rs
  • crates/logfwd-transform/tests/scanner_datafusion_boundary.rs

Comment thread crates/logfwd-transform/src/conflict_schema.rs Outdated
Comment on lines +83 to +108
let mut extra_fields: Vec<Field> = Vec::with_capacity(groups.len());
let mut extra_arrays: Vec<Arc<dyn Array>> = Vec::with_capacity(groups.len());

for (base, members) in &groups {
let int_col = members
.iter()
.find(|(s, _)| *s == "_int")
.map(|(_, idx)| batch.column(*idx).as_ref());
let float_col = members
.iter()
.find(|(s, _)| *s == "_float")
.map(|(_, idx)| batch.column(*idx).as_ref());
let str_col = members
.iter()
.find(|(s, _)| *s == "_str")
.map(|(_, idx)| batch.column(*idx).as_ref());

let merged = merge_to_utf8(int_col, float_col, str_col, batch.num_rows());
extra_fields.push(Field::new(*base, DataType::Utf8, true));
extra_arrays.push(merged);
}

// Append the computed bare columns to the existing schema and arrays.
let mut fields: Vec<Field> = schema.fields().iter().map(|f| (**f).clone()).collect();
fields.extend(extra_fields);
let new_schema = Arc::new(Schema::new(fields));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the file structure and context
head -120 crates/logfwd-transform/src/conflict_schema.rs | tail -50

Repository: strawgate/memagent

Length of output: 1865


🏁 Script executed:

# Check what `groups` is - need to see the full function context
sed -n '50,120p' crates/logfwd-transform/src/conflict_schema.rs

Repository: strawgate/memagent

Length of output: 3007


🏁 Script executed:

# Check the coding style documentation for error handling
cat dev-docs/CODE_STYLE.md | grep -A 10 -B 2 "unwrap\|error\|production"

Repository: strawgate/memagent

Length of output: 1505


🏁 Script executed:

# Look for HashMap imports and confirm the type of `groups`
rg "HashMap|groups" crates/logfwd-transform/src/conflict_schema.rs -B 5 -A 2

Repository: strawgate/memagent

Length of output: 2402


Stabilize synthesized-column order to prevent schema flapping.

groups is a HashMap, so appended bare columns come out in arbitrary order. With multiple conflict groups, this causes SELECT * and schema-order-sensitive downstream consumers to produce different results between runs. Sort by the minimum source-column index within each group before appending.

♻️ One way to stabilize the append order
-    let mut extra_fields: Vec<Field> = Vec::with_capacity(groups.len());
-    let mut extra_arrays: Vec<Arc<dyn Array>> = Vec::with_capacity(groups.len());
-
-    for (base, members) in &groups {
+    let mut ordered_groups: Vec<_> = groups.iter().collect();
+    ordered_groups.sort_by_key(|(_, members)| {
+        members
+            .iter()
+            .map(|(_, idx)| *idx)
+            .min()
+            .unwrap_or(usize::MAX)
+    });
+
+    let mut extra_fields: Vec<Field> = Vec::with_capacity(ordered_groups.len());
+    let mut extra_arrays: Vec<Arc<dyn Array>> = Vec::with_capacity(ordered_groups.len());
+
+    for (base, members) in ordered_groups {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let mut extra_fields: Vec<Field> = Vec::with_capacity(groups.len());
let mut extra_arrays: Vec<Arc<dyn Array>> = Vec::with_capacity(groups.len());
for (base, members) in &groups {
let int_col = members
.iter()
.find(|(s, _)| *s == "_int")
.map(|(_, idx)| batch.column(*idx).as_ref());
let float_col = members
.iter()
.find(|(s, _)| *s == "_float")
.map(|(_, idx)| batch.column(*idx).as_ref());
let str_col = members
.iter()
.find(|(s, _)| *s == "_str")
.map(|(_, idx)| batch.column(*idx).as_ref());
let merged = merge_to_utf8(int_col, float_col, str_col, batch.num_rows());
extra_fields.push(Field::new(*base, DataType::Utf8, true));
extra_arrays.push(merged);
}
// Append the computed bare columns to the existing schema and arrays.
let mut fields: Vec<Field> = schema.fields().iter().map(|f| (**f).clone()).collect();
fields.extend(extra_fields);
let new_schema = Arc::new(Schema::new(fields));
let mut ordered_groups: Vec<_> = groups.iter().collect();
ordered_groups.sort_by_key(|(_, members)| {
members
.iter()
.map(|(_, idx)| *idx)
.min()
.unwrap_or(usize::MAX)
});
let mut extra_fields: Vec<Field> = Vec::with_capacity(ordered_groups.len());
let mut extra_arrays: Vec<Arc<dyn Array>> = Vec::with_capacity(ordered_groups.len());
for (base, members) in ordered_groups {
let int_col = members
.iter()
.find(|(s, _)| *s == "_int")
.map(|(_, idx)| batch.column(*idx).as_ref());
let float_col = members
.iter()
.find(|(s, _)| *s == "_float")
.map(|(_, idx)| batch.column(*idx).as_ref());
let str_col = members
.iter()
.find(|(s, _)| *s == "_str")
.map(|(_, idx)| batch.column(*idx).as_ref());
let merged = merge_to_utf8(int_col, float_col, str_col, batch.num_rows());
extra_fields.push(Field::new(*base, DataType::Utf8, true));
extra_arrays.push(merged);
}
// Append the computed bare columns to the existing schema and arrays.
let mut fields: Vec<Field> = schema.fields().iter().map(|f| (**f).clone()).collect();
fields.extend(extra_fields);
let new_schema = Arc::new(Schema::new(fields));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-transform/src/conflict_schema.rs` around lines 83 - 108, The
appended synthesized columns come from `groups` (a HashMap) and are added in
arbitrary order causing schema flapping; convert `groups` into a deterministic
Vec sorted by the minimum source-column index of each group's `members` before
the for-loop that builds `extra_fields`/`extra_arrays` (the code that calls
`merge_to_utf8`, pushes to `extra_fields` and `extra_arrays`, and uses
`batch.num_rows()`); compute the min index for each `(base, members)` entry,
sort by that min, then iterate the sorted vector so the order of pushed
fields/arrays (and thus the resulting `new_schema`) is stable across runs.

Comment on lines +598 to +603
//
// Normalize the batch first: if the scanner detected type conflicts it
// emits suffixed columns (`status_int`, `status_str`). Add a bare
// `status: Utf8` column so SQL using bare names resolves on both clean
// and conflict batches.
let batch = conflict_schema::normalize_conflict_columns(batch);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This mutates the physical schema seen by SELECT *.

Normalizing before MemTable registration makes the synthetic bare columns part of logs, so conflict batches now expose both the typed variants and the synthetic view column. That breaks round-trip / wildcard semantics; the bare-name aliasing needs to happen in planning or projection, not by altering the input batch.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-transform/src/lib.rs` around lines 598 - 603, The code
currently calls conflict_schema::normalize_conflict_columns(batch) which mutates
the physical RecordBatch (the `batch` used to construct the MemTable), causing
synthetic bare columns to become part of `logs` and breaking SELECT * / wildcard
round-trips; instead stop normalizing the RecordBatch before MemTable
registration and move the bare-name aliasing into the query planning/projection
stage (so the MemTable stores the original physical schema). Concretely: revert
removal of columns from the RecordBatch (undo the call to
conflict_schema::normalize_conflict_columns when building the MemTable), keep
registering the original `batch` in the MemTable, and implement the synthetic
`status: Utf8` aliasing inside the planner/projection codepath that prepares
projections for execution (referencing the same conflict_schema logic but
applied to projection expressions rather than mutating `batch`).

Comment thread crates/logfwd-transform/src/udf/json_extract.rs
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
dev-docs/research/type-suffix-redesign.md (1)

194-200: ⚠️ Potential issue | 🟠 Major

Mark schema padding as planned, not current behavior.

The PR objectives for #684 explicitly leave schema padding to #625, but this section reads as if missing SQL-referenced columns are already padded on every batch. Please move this to a planned/future note and separate it from what the current PR actually implements.

📝 Suggested wording
-At config time, `QueryAnalyzer` extracts `referenced_columns` from the
-user's SQL. Before registering each batch as a MemTable, pad with null
-columns for any SQL-referenced columns missing from the batch.
+Planned (`#625`): at config time, `QueryAnalyzer` will extract
+`referenced_columns` from the user's SQL and pad missing
+SQL-referenced columns with nulls.
+
+Current PR scope stops earlier: conflict batches are normalized for
+bare-name resolution, but cross-batch padding is still future work.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@dev-docs/research/type-suffix-redesign.md` around lines 194 - 200, The doc
currently states that QueryAnalyzer extracts referenced_columns and that batches
are padded with nulls and normalize_conflict_columns synthesizes columns; update
this text to clearly mark schema padding as planned work (to be handled by issue
`#625`) rather than current behavior: change language around QueryAnalyzer,
MemTable batch padding, and normalize_conflict_columns to indicate these are
future/planned features, separate them from the implemented behavior for
conflict batches, and add a short note linking the planned behavior to the
corresponding issue number so readers understand it is not yet implemented.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/logfwd-core/tests/scanner_conformance.rs`:
- Around line 99-102: The test currently silently skips validation when neither
the suffixed nor bare column exists because col is set via .or_else(||
batch.column_by_name(...)) and then matched with if let Some(col), which masks
missing columns; change this to eagerly fail by replacing that lookup with
expect or unwrap_or_else (e.g., let col =
batch.column_by_name(&format!("{key_str}__str")).or_else(||
batch.column_by_name(key_str)).expect(&format!("missing column for key: {}",
key_str));) so a missing __str/__int/__float or bare column panics the test
immediately; apply the same change for the other two occurrences that resolve
columns for __int and __float as well.
- Around line 164-171: The Int64 fallback in the test is using a too-large
tolerance `(actual - expected).abs() < 1.0`; tighten this by asserting exact
equality for integer-derived floats or a tiny epsilon instead. Update the
assertion in the Int64Array branch (where `actual = arr.value(row) as f64` and
`expected` are compared) to either use `actual == expected` for semantically
integral spellings or `(actual - expected).abs() < EPSILON` with a small epsilon
(e.g., 1e-12 or f64::EPSILON) so that values like `1.9` won't be mistaken for
`1`. Ensure the assertion message remains helpful and references `key_str` and
`row`.

In `@dev-docs/research/column-type-constraints.md`:
- Around line 112-117: The current wording in "Config declares type hints;
reader infers by default." incorrectly asserts the default path satisfies C3;
update the text to state the reader infers types by default and that this
default behavior satisfies C1 (per-batch correctness) only, and explicitly note
that C3 (cross-batch schema stability) and C8 are guaranteed only when a field
is pinned via config (e.g., schema: { status: int }) or after implementing
additional schema-stability work; keep the example of schema pinning and remove
the claim that C3 is satisfied by the default data-driven path.

In `@dev-docs/research/type-suffix-redesign.md`:
- Around line 204-223: Update the Phase 10/10b/10c roadmap text to reflect PR
`#684` as implemented: mark Phase 10 as complete (change “to be updated” to done),
remove or move Phase 10b from “future work” to completed changes noting the
double-underscore rename in StreamingBuilder and StorageBuilder and the updates
to strip_conflict_suffix (conflict_schema.rs) and suffix_order
(json_extract.rs), and state that logfwd.conflict_groups schema metadata
stamping was added; also update Phase 10c to declare ConflictGroups and
TypedValue added to logfwd-output and note that OTLP, JSON Lines/TCP/UDP, and
stdout sinks now preserve types per-row and that tests for conflict batch
round-trips were added.

---

Duplicate comments:
In `@dev-docs/research/type-suffix-redesign.md`:
- Around line 194-200: The doc currently states that QueryAnalyzer extracts
referenced_columns and that batches are padded with nulls and
normalize_conflict_columns synthesizes columns; update this text to clearly mark
schema padding as planned work (to be handled by issue `#625`) rather than current
behavior: change language around QueryAnalyzer, MemTable batch padding, and
normalize_conflict_columns to indicate these are future/planned features,
separate them from the implemented behavior for conflict batches, and add a
short note linking the planned behavior to the corresponding issue number so
readers understand it is not yet implemented.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ead96112-98f0-46db-b525-6cc7be49e901

📥 Commits

Reviewing files that changed from the base of the PR and between 17bee59 and 9db315b.

📒 Files selected for processing (11)
  • crates/logfwd-arrow/src/lib.rs
  • crates/logfwd-arrow/src/scanner.rs
  • crates/logfwd-arrow/src/storage_builder.rs
  • crates/logfwd-arrow/src/streaming_builder.rs
  • crates/logfwd-core/tests/compliance_data.rs
  • crates/logfwd-core/tests/scanner_conformance.rs
  • crates/logfwd-transform/src/conflict_schema.rs
  • crates/logfwd-transform/src/udf/json_extract.rs
  • crates/logfwd-transform/tests/scanner_datafusion_boundary.rs
  • dev-docs/research/column-type-constraints.md
  • dev-docs/research/type-suffix-redesign.md

Comment thread crates/logfwd-core/tests/scanner_conformance.rs Outdated
Comment thread crates/logfwd-core/tests/scanner_conformance.rs Outdated
Comment thread dev-docs/research/column-type-constraints.md Outdated
Comment thread dev-docs/research/type-suffix-redesign.md
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/logfwd-transform/src/lib.rs (1)

275-292: ⚠️ Potential issue | 🟠 Major

Don't strip real __* field names during pushdown.

Line 285 still treats any foo__str / foo__int / foo__float reference as a conflict variant. That breaks legitimate JSON keys with those endings: SELECT error__str FROM logs makes scan_config() request error, so field pushdown drops the actual column. Strip only names that are known conflict groups, for example from logfwd.conflict_groups, instead of matching on the suffix alone.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-transform/src/lib.rs` around lines 275 - 292, strip_type_suffix
currently strips any name ending in "__str"/"__int"/"__float" which removes
legitimate JSON keys; change it to only strip when the base name is a known
conflict group. Modify strip_type_suffix (or its caller) to consult the
configured conflict groups (e.g. logfwd.conflict_groups) and only return the
base when the base appears in that set of conflict group names (otherwise return
the original name unchanged); reference the function name strip_type_suffix and
the conflict groups configuration (logfwd.conflict_groups) so the check is
implemented against the canonical list rather than just matching suffixes.
♻️ Duplicate comments (1)
crates/logfwd-transform/src/lib.rs (1)

607-611: ⚠️ Potential issue | 🟠 Major

Keep the physical logs schema untouched.

Line 611 still feeds a normalized batch into the MemTable, so conflict batches expose both the typed variants and the synthetic bare alias through SELECT *. That breaks the round-trip / wildcard semantics this PR is trying to preserve. Apply bare-name aliasing in planning/projection instead of mutating the RecordBatch before table registration.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-transform/src/lib.rs` around lines 607 - 611, The code
currently mutates the physical RecordBatch by calling
conflict_schema::normalize_conflict_columns(batch) and then registering that
normalized batch in the MemTable, which exposes both typed variants and the
synthetic bare alias to SELECT *; revert this by keeping the original batch
intact for table registration (register the unmodified batch with MemTable) and
move the bare-name aliasing logic out of
conflict_schema::normalize_conflict_columns into the query planning/projection
stage so alias columns are synthesized at plan-time (not in the RecordBatch),
e.g., stop passing the normalized batch into MemTable and instead apply the
aliasing during projection/plan construction where the planner can map bare
names to typed variants without mutating the stored batch.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@crates/logfwd-transform/src/lib.rs`:
- Around line 275-292: strip_type_suffix currently strips any name ending in
"__str"/"__int"/"__float" which removes legitimate JSON keys; change it to only
strip when the base name is a known conflict group. Modify strip_type_suffix (or
its caller) to consult the configured conflict groups (e.g.
logfwd.conflict_groups) and only return the base when the base appears in that
set of conflict group names (otherwise return the original name unchanged);
reference the function name strip_type_suffix and the conflict groups
configuration (logfwd.conflict_groups) so the check is implemented against the
canonical list rather than just matching suffixes.

---

Duplicate comments:
In `@crates/logfwd-transform/src/lib.rs`:
- Around line 607-611: The code currently mutates the physical RecordBatch by
calling conflict_schema::normalize_conflict_columns(batch) and then registering
that normalized batch in the MemTable, which exposes both typed variants and the
synthetic bare alias to SELECT *; revert this by keeping the original batch
intact for table registration (register the unmodified batch with MemTable) and
move the bare-name aliasing logic out of
conflict_schema::normalize_conflict_columns into the query planning/projection
stage so alias columns are synthesized at plan-time (not in the RecordBatch),
e.g., stop passing the normalized batch into MemTable and instead apply the
aliasing during projection/plan construction where the planner can map bare
names to typed variants without mutating the stored batch.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 87960c00-6704-4c81-9120-fe564d66251a

📥 Commits

Reviewing files that changed from the base of the PR and between 9db315b and 8dc9354.

📒 Files selected for processing (2)
  • crates/logfwd-transform/src/lib.rs
  • crates/logfwd-transform/tests/scanner_datafusion_boundary.rs

strawgate and others added 7 commits April 2, 2026 17:26
…#445)

Single-type fields now use bare column names (`status`, `level`) with their
native Arrow type. Suffixed names (`status_int`, `status_str`) only appear
when the same field has multiple types within a single batch.

Changes:
- StreamingBuilder + StorageBuilder: conflict detection in finish_batch()
- Delete rewriter.rs (775 lines of dead SQL text rewriter, never wired in)
- json_extract UDF: suffix_order tries bare name as fallback; non-numeric
  columns return null instead of coercing strings to numbers
- All tests updated to expect bare names for single-type fields
- CI: set RUSTC_WRAPPER="" so cargo test works without sccache

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- json_extract: add regression tests for json_int/json_float returning
  null when the field is a quoted string (bare Utf8 column, no conflict)
- scanner_conformance oracle: panic on type mismatch instead of silently
  skipping assertions; fall back bare→suffixed column lookup so single-
  type fields are actually verified
- scanner_conformance oracle: accept Int64 column when checking a float
  value (e.g. -0 has no decimal/exponent so scanner emits Int64)
- assert_builders_consistent: assert st_col is also string-typed before
  casting to Utf8 for value comparison

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…atches (#625)

When the scanner detects a type conflict for a field (e.g. `status` appears
as both int and string across rows), it emits `status_int: Int64` and
`status_str: Utf8View`. SQL using the bare name `status` would fail to resolve.

Add `normalize_conflict_columns()` in the new `conflict_schema` module:
- Detects conflict groups: ≥2 suffixed variants of the same base name with
  no existing bare column
- Adds a computed `status: Utf8` column via COALESCE(int→str, float→str, str)
- Single lone `foo_str` columns (field literally named `foo_str`) are NOT
  treated as conflicts (require ≥2 variants)

Wire into `SqlTransform::execute()` before MemTable registration so every
batch is normalized before DataFusion sees it.

After this change `SELECT status FROM logs` works on both clean batches
(bare `status: Int64` from scanner) and conflict batches (synthesized
`status: Utf8`). Users call `int(status)` / `float(status)` for numeric ops.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…, ConflictGroups

- type-suffix-redesign.md: document __ suffix convention, logfwd.conflict_groups
  metadata key, ConflictGroups/TypedValue output abstraction, implementation phases
- column-type-constraints.md: answer all 6 open questions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ct_groups metadata

Rename conflict column suffixes from single-underscore (_int/_str/_float) to
double-underscore (__int/__str/__float) so real field names like `status_int`
cannot collide with synthesized conflict columns.

Stamp `logfwd.conflict_groups` Arrow schema metadata key when builders detect
type conflicts (format: "status:int,str;duration:float,int"). Zero overhead when
no conflicts in the batch.

Updates:
- storage_builder + streaming_builder: emit __int/__float/__str on conflict,
  accumulate conflict_meta, attach CONFLICT_GROUPS_METADATA_KEY to schema
- lib.rs: re-export CONFLICT_GROUPS_METADATA_KEY
- conflict_schema.rs: CONFLICT_SUFFIXES → __ prefix, HashMap → BTreeMap for
  deterministic ordering, preserve schema metadata in normalize_conflict_columns
- json_extract.rs: suffix_order updated for __ prefixes + bare-name fallback
- All tests updated: compliance_data, scanner_conformance, scanner_datafusion_boundary
  (Section 5 conflict tests: 4 new tests for bare-name SQL on conflict batches)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… test fixtures

The strip_type_suffix helper was stripping single-underscore suffixes (_str,
_int, _float), which caused two bugs after Phase 10b:

1. Real JSON field names like `start_int` would be incorrectly mapped to
   `start` for scanner pushdown — the exact ambiguity Phase 10b was meant
   to eliminate.
2. `strip_type_suffix("status__int")` returned `"status_"` (strips trailing
   `_int`, leaves a dangling underscore), giving the scanner a garbage key.

Fix: strip __str/__int/__float (double underscore) only. For bare names (the
normal post-Phase-10 case) the function is now a no-op, which is correct.

Also:
- Update stale comments in lib.rs (line 106 and 600) and the Section 5 block
  in scanner_datafusion_boundary.rs to reference __ suffixes consistently.
- Update make_test_batch() and all 14 dependent unit tests to use bare column
  names (level, msg, status, host, region, val, n) instead of the old
  level_str/msg_str/… convention, matching what the Phase-10 scanner emits.
- test_filter_hints_typed_column_stripped now tests severity__int rather than
  severity_int, exercising the actual double-underscore strip path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- scanner: assert legacy _str/_int suffixed columns absent in single-type test
- streaming_builder + storage_builder: detect duplicate output column names (reserve_name guard)
- scanner_conformance: panic on missing scanner columns and wrong types instead of silently skipping
- integration: remove stale {field}_{type} and team_str comment references

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@strawgate strawgate force-pushed the feat/suffix-on-conflict-445 branch from d3c7854 to deddbe3 Compare April 2, 2026 22:37
Commit 085c2cf changed BatchMetadata.resource_attrs to Arc<Vec<...>> but
didn't update the test constructors, breaking compilation on Linux/macOS CI.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@strawgate
Copy link
Copy Markdown
Owner Author

@coderabbitai please do a very thorough review

Same 085c2cf regression — missed two more callsites in
elasticsearch_arrow_ipc.rs and elasticsearch_arrow.rs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 2, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Read key changed files for thorough review

echo "=== storage_builder.rs ==="
cat crates/logfwd-arrow/src/storage_builder.rs

echo ""
echo "=== streaming_builder.rs ==="
cat crates/logfwd-arrow/src/streaming_builder.rs

echo ""
echo "=== conflict_schema.rs ==="
cat crates/logfwd-transform/src/conflict_schema.rs

echo ""
echo "=== json_extract.rs ==="
cat crates/logfwd-transform/src/udf/json_extract.rs

echo ""
echo "=== logfwd-transform/src/lib.rs ==="
cat crates/logfwd-transform/src/lib.rs

echo ""
echo "=== scanner.rs (relevant tests) ==="
grep -n "type_conflict\|__int\|__str\|__float\|bare\|suffix" crates/logfwd-arrow/src/scanner.rs | head -80

echo ""
echo "=== COLUMN_NAMING.md ==="
fd -i "COLUMN_NAMING.md" --exec cat {}

echo ""
echo "=== docs/type-suffix-redesign.md ==="
fd -i "type-suffix-redesign.md" --exec cat {}

echo ""
echo "=== scanner_conformance.rs ==="
cat crates/logfwd-core/tests/scanner_conformance.rs

echo ""
echo "=== lib.rs for logfwd-arrow ==="
cat crates/logfwd-arrow/src/lib.rs

echo ""
echo "=== Check for any remaining _int/_str/_float single-underscore suffix references ==="
rg -n "_int\b|_str\b|_float\b" --type rust -g "!*test*" -g "!*spec*" crates/ | grep -v "//\|#\[" | head -60

echo ""
echo "=== Check strip_type_suffix usage ==="
rg -n "strip_type_suffix" --type rust -A5 -B2 crates/

echo ""
echo "=== Check CONFLICT_GROUPS_METADATA_KEY usage ==="
rg -n "CONFLICT_GROUPS_METADATA_KEY\|conflict_groups" --type rust crates/

echo ""
echo "=== Check for any Kani proof updates ==="
rg -n "kani" --type rust crates/ | grep -v "test\|spec\|//.*kani" | head -30

echo ""
echo "=== Check suffix_order in json_extract ==="
grep -n "suffix_order\|bare\|fallback\|__int\|__str\|__float" crates/logfwd-transform/src/udf/json_extract.rs

echo ""
echo "=== Check for DESIGN.md (ADR requirement from learnings) ==="
fd -i "DESIGN.md" --exec cat {}

echo ""
echo "=== scanner_datafusion_boundary.rs new tests ==="
cat crates/logfwd-transform/tests/scanner_datafusion_boundary.rs

Length of output: 235551


@strawgate Here's a thorough review of PR #684.


Summary

The core design (bare names for single-type, __-suffixed for conflict, normalize_conflict_columns synthesis, deletion of rewriter.rs) is sound and the implementation is clean. The builder deferred-write pattern is preserved correctly per long-term memory. There are several issues worth addressing before merge, ranging from a stale documentation file (which violates a stored guideline) to a correctness gap in conflict detection.


🔴 Critical

1. COLUMN_NAMING.md is not updated for the new double-underscore format

The stored guideline (ID 321c958a) says COLUMN_NAMING.md must be updated when the type-conflict suffix format changes. The current file still documents the old single-underscore format:

# Current COLUMN_NAMING.md (stale):
| status_int  | Int64 | Rows where status was an integer |
| status_str  | Utf8  | Rows where status was a string   |

# Type suffix table still shows:
| _str | String … | Utf8 / Utf8View |
| _int | Integer  | Int64           |

This needs to be updated to __int/__str/__float with the collision-avoidance rationale explained in the design doc.


🟠 High

2. normalize_conflict_columns ignores logfwd.conflict_groups metadata — re-derives by name scanning instead

The design document explicitly states:

normalize_conflict_columns() in logfwd-transform reads the logfwd.conflict_groups metadata and adds a synthesized bare column

However, the actual implementation in conflict_schema.rs completely ignores that metadata key and instead re-discovers conflict groups by scanning all column names for __int/__str/__float suffixes:

// conflict_schema.rs — ignores CONFLICT_GROUPS_METADATA_KEY:
for (idx, field) in schema.fields().iter().enumerate() {
    if let Some((base, suf)) = strip_conflict_suffix(field.name()) {
        ...
    }
}

This creates a false positive risk: a user whose JSON documents contain keys literally named foo__int and foo__str (double-underscore is unusual but valid) will have a spurious foo: Utf8 column synthesized — even though the builders never flagged a conflict for foo. The metadata key is the authoritative source; the name-scanning approach is fragile.

The fix is to read CONFLICT_GROUPS_METADATA_KEY from the schema metadata first, and only process bases listed there:

let meta_str = schema.metadata().get(CONFLICT_GROUPS_METADATA_KEY);
if meta_str.is_none() {
    return batch; // fast path — no conflict groups recorded
}
// parse "status:int,str;duration:float,int" to get authoritative base names

3. scanner_datafusion_boundary.rs test helpers use the old single-underscore naming convention

The new test helpers in Section 1 still use the pre-Phase-10 naming:

// make_utf8view_batch() — still old style:
Field::new("level_str", DataType::Utf8View, true),   // should be "level"
Field::new("msg_str", DataType::Utf8View, true),      // should be "msg"
Field::new("count_int", DataType::Int64, true),        // should be "count" or just "count"

And the SQL in those tests queries level_str, msg_str, etc.:

"SELECT * FROM logs WHERE level_str = 'ERROR'"

These helpers test DataFusion's ability to handle Utf8View columns, but they're using pre-Phase-10 suffixed names, which contradicts the entire premise of this PR. The tests should use bare names (level, msg, count) to be consistent with the new convention — and to actually validate that bare-name SQL works end-to-end.

4. Phase doc (type-suffix-redesign.md) has a self-contradictory Phase 10 description

The Implementation Phases section says:

### Phase 10 (complete — PR `#684`)
- Builders emit bare names for single-type fields, `_int`/`_str`/`_float`
  for conflicts (single underscore — to be updated)       ← contradicts the PR

But the rest of the same document describes __int/__str/__float (double-underscore), and Phase 10b is listed as the rename that "will" do it — yet this PR already does it. The phase descriptions need to be reconciled to accurately reflect what PR #684 delivers vs what remains.


🟡 Medium

5. Cross-batch type instability for the same SQL query

When the same field appears in both a clean batch and a conflict batch across consecutive calls to execute():

  • Clean batch: status: Int64WHERE status > 400 works as a numeric comparison
  • Conflict batch: synthesized status: Utf8WHERE status > 400 compares a string to an integer (different semantics or error in DataFusion)

This is partially documented ("SQL only synthesizes a bare column for SQL resolution") but there's no guard or test for this cross-batch scenario. Users relying on WHERE status > 400 will silently get wrong results when a conflict batch arrives. Worth adding a test in scanner_datafusion_boundary.rs to document the required behavior (int(status) > 400 idiom).

6. Missing proptest coverage for multi-field varying-schema conflict batches

Per the stored guideline (ID d0d5d16b), proptest coverage is required for multi-field varying-schema batches through the builders. The new conflict detection logic in finish_batch() is tested only with curated unit tests. Adding a proptest that generates batches with random field names and randomly mixed types would close this gap, especially to guard against the reserve_name duplicate-column-name error path.

7. StorageBuilder str columns are Utf8, StreamingBuilder str columns are Utf8View

This divergence is intentional but has a downstream effect: normalize_conflict_columns calls compute::cast(c, &DataType::Utf8) on the str variant and then downcast_ref::<StringArray>(). This works for both types since Arrow can cast Utf8View → Utf8. However, the str_s variable is then of type StringArray regardless of origin, losing the zero-copy property for StreamingBuilder batches. Since normalize_conflict_columns is only invoked in the SQL transform path (not the storage path), this is acceptable — but it's worth a comment explaining why.


🔵 Low / Nit

8. streaming_builder.rs cross-module constant reference

// streaming_builder.rs:
meta.insert(
    crate::storage_builder::CONFLICT_GROUPS_METADATA_KEY.to_string(),
    ...
);

Since CONFLICT_GROUPS_METADATA_KEY is re-exported from crate root via lib.rs, using crate::CONFLICT_GROUPS_METADATA_KEY (or defining it in a shared constants module) would be cleaner and more symmetric with how StorageBuilder references it.

9. StorageBuilder::finish_batch _raw column is pre-inserted into emitted_names before the field loop

if self.keep_raw && !self.raw_values.is_empty() {
    emitted_names.insert("_raw".to_string());
}

This is correct but the check !self.raw_values.is_empty() is asymmetric with the actual _raw emission later which uses the same condition. If keep_raw = true but raw_values happens to be empty mid-batch (edge case during fuzzing), _raw won't be reserved but also won't be emitted, so there's no real bug. A comment explaining the condition would help.

10. edge_type_conflict proptest uses br#"..."# — the conflict uses s as the field name

In scanner.rs, test_type_conflict checks s__int and s__str, and the edge case edge_type_conflict uses field s. These are fine but slightly confusing alongside tests that use status. Renaming to status throughout would improve readability.


Required Updates per Long-Term Memory

Per stored guideline ID 321c958a:

Update COLUMN_NAMING.md when: type conflict suffix format changes, conflict detection logic changes, duplicate key handling changes, or bare-name-by-default behavior changes

All four conditions apply here. The COLUMN_NAMING.md update is required before merge.


✏️ Learnings added
Learnt from: strawgate
URL: https://github.com/strawgate/memagent/pull/684

Timestamp: 2026-04-02T22:48:43.853Z
Learning: In the strawgate/memagent repository, PR `#684` changed the type-conflict column suffix format from single-underscore (`_int`/`_str`/`_float`) to double-underscore (`__int`/`__str`/`__float`) in StorageBuilder and StreamingBuilder (crates/logfwd-arrow/src/storage_builder.rs and streaming_builder.rs). normalize_conflict_columns (crates/logfwd-transform/src/conflict_schema.rs) discovers conflict groups by scanning column name suffixes rather than reading the logfwd.conflict_groups Arrow schema metadata key, despite the design doc stating it should use the metadata key.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: CR
Repo: strawgate/memagent PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-04-02T22:35:09.600Z
Learning: Applies to docs/COLUMN_NAMING.md : Type conflict column names use double-underscore suffixes (status__int, status__str); any change to suffix format requires COLUMN_NAMING.md update, schema description update, and check for SQL rewriter references

Learnt from: CR
Repo: strawgate/memagent PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-04-02T22:35:09.600Z
Learning: Applies to docs/COLUMN_NAMING.md : Update COLUMN_NAMING.md when: type conflict suffix format changes, conflict detection logic changes, duplicate key handling changes, or bare-name-by-default behavior changes

Learnt from: CR
Repo: strawgate/memagent PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-04-02T22:35:09.600Z
Learning: Applies to logfwd-arrow/src/**/*.rs : Reject incremental writes directly into Arrow builders during scanning; require proptest coverage over multi-field varying-schema batches

Learnt from: CR
Repo: strawgate/memagent PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-04-02T22:35:09.600Z
Learning: Applies to logfwd-arrow/src/**/*.rs : StorageBuilder and StreamingBuilder use deferred-write pattern: collect (field_index, value) during scanning, bulk-build Arrow columns at finish_batch; do not revert to incremental writes

Learnt from: CR
Repo: strawgate/memagent PR: 0
File: coderabbit-custom-pre-merge-checks-unique-id-file-non-traceable-F7F2B60C-1728-4C9A-8889-4F2235E186CA.txt:0-0
Timestamp: 2026-04-02T22:35:09.600Z
Learning: Applies to **/*.rs : Breaking change in logfwd-core requires: (1) PR description names the break and why, (2) every call site updated in same PR, (3) Kani proof contracts updated, (4) ADR in DESIGN.md if design insight drove change

conflict_schema: use logfwd.conflict_groups metadata as authoritative
discriminator instead of suffix heuristics — prevents false-positive
synthesis on user fields literally named foo__int / foo__str

json_extract: Str mode coalesces all conflict variants row-by-row so
json(_raw,'status') returns "200" not null when status is int in that row

scanner_conformance: tighten Int64 fallback tolerance from < 1.0 to exact
equality; fix it uses col.unwrap_or_else (already done); fix loose test

scanner_datafusion_boundary: stamp logfwd.conflict_groups metadata on
test conflict batches to match real builder output

docs: mark AnalyzerRule/TableProvider section as planned (#625), not
current; fix C3 claim in column-type-constraints.md; refresh phase
breakdown in type-suffix-redesign.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
strawgate and others added 2 commits April 2, 2026 18:30
Explains that C3 is satisfied via the planned TableProvider/AnalyzerRule
approach: referenced columns are advertised as Utf8 with cast rewrites to
typed backing columns, only for columns the query actually uses.
normalize_conflict_columns is a batch-level approximation; #625 replaces it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@strawgate strawgate merged commit 8486bf8 into master Apr 2, 2026
7 of 8 checks passed
@strawgate strawgate deleted the feat/suffix-on-conflict-445 branch April 2, 2026 23:39
strawgate added a commit that referenced this pull request Apr 2, 2026
- docs/COLUMN_NAMING.md: update to double-underscore suffixes (__int,
  __str, __float), document logfwd.conflict_groups metadata key, bare
  synthesized column for SQL, int()/float() UDF idiom, and cross-batch
  type instability warning
- scanner_datafusion_boundary.rs: add cross_batch_int_udf_works test
  documenting that int(status) works on both clean and conflict batches
- conflict_schema.rs: explain why Utf8 cast intentionally loses zero-copy
  for StreamingBuilder str columns (SQL transform path only)
- storage_builder.rs: comment on _raw pre-reservation condition
- streaming_builder.rs: use crate::CONFLICT_GROUPS_METADATA_KEY instead
  of crate::storage_builder::CONFLICT_GROUPS_METADATA_KEY (already
  re-exported at crate root)
- scanner.rs: rename test field s → status in test_type_conflict

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
strawgate added a commit that referenced this pull request Apr 2, 2026
- docs/COLUMN_NAMING.md: update to double-underscore suffixes (__int,
  __str, __float), document logfwd.conflict_groups metadata key, bare
  synthesized column for SQL, int()/float() UDF idiom, and cross-batch
  type instability warning
- scanner_datafusion_boundary.rs: add cross_batch_int_udf_works test
  documenting that int(status) works on both clean and conflict batches
- conflict_schema.rs: explain why Utf8 cast intentionally loses zero-copy
  for StreamingBuilder str columns (SQL transform path only)
- storage_builder.rs: comment on _raw pre-reservation condition
- streaming_builder.rs: use crate::CONFLICT_GROUPS_METADATA_KEY instead
  of crate::storage_builder::CONFLICT_GROUPS_METADATA_KEY (already
  re-exported at crate root)
- scanner.rs: rename test field s → status in test_type_conflict

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
strawgate added a commit that referenced this pull request Apr 3, 2026
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
strawgate added a commit that referenced this pull request Apr 3, 2026
Since the type-suffix redesign (#684), columns are only suffixed on
type conflict. The bench SQL referenced timestamp_str, level_str, etc.
which no longer exist, causing every transform to fail silently and
produce 0 output lines with 7.2 GB RSS from unbounded buffering.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
strawgate added a commit that referenced this pull request Apr 5, 2026
- Column naming convention section: update to describe actual behavior.
  Single-type fields use the base name (no suffix). Conflicting fields
  become a Struct column under the base name. Legacy _str/_int suffixes
  are not emitted. The old table described the pre-#684 behavior.
- kubernetes.md: /metrics returns 410 Gone (not "does not expose").

Special columns table (_raw_str, _file_str etc.) left unchanged pending
verification of actual emitted column names in code.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Column type design: suffix only on conflict, bare names by default

1 participant