feat: suffix column names only on type conflict, delete dead rewriter (#445)#684
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
Warning
|
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Out of Scope Changes check | ❓ Inconclusive | Most changes align with #445 objectives, but CI modification (RUSTC_WRAPPER environment variable) and documentation updates to unimplemented features (normalize_conflict_columns, ConflictGroups output abstraction) may exceed scope. |
Clarify whether CI environment changes are required for this PR or belong in a separate change, and confirm if documentation for Phase 10b/10c unimplemented features should land before or after implementation. |
✅ Passed checks (1 passed)
| Check name | Status | Explanation |
|---|---|---|
| Linked Issues check | ✅ Passed | PR implements all core requirements from #445: bare column names for single-type fields, suffixed columns (__int/__str/__float) only on conflicts, removes dead rewriter code (775 lines), and updates all downstream consumers. |
✏️ Tip: You can configure your own custom pre-merge checks in the settings.
Comment @coderabbitai help to get the list of available commands and usage tips.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
crates/logfwd-arrow/src/scanner.rs (1)
193-213:⚠️ Potential issue | 🟡 MinorAssert the legacy suffixed names are absent in the single-type path.
These assertions only prove that
hostandstatusexist. If the scanner accidentally emitted bothhostandhost_str/status_int, this test would still pass and miss the main invariant of the redesign.💡 Tighten the regression
assert_eq!( batch .column_by_name("status") .unwrap() .as_any() .downcast_ref::<Int64Array>() .unwrap() .value(1), 404 ); + assert!( + batch.column_by_name("host_str").is_none(), + "single-type string fields should not emit legacy suffixed columns" + ); + assert!( + batch.column_by_name("status_int").is_none(), + "single-type int fields should not emit legacy suffixed columns" + );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/logfwd-arrow/src/scanner.rs` around lines 193 - 213, The test currently only checks that "host" and "status" columns exist but not that legacy suffixed names were omitted; update the single-type fields assertions to also assert absence of the legacy suffixed names by checking batch.column_by_name("host_str").is_none() and batch.column_by_name("status_int").is_none() (or equivalent asserts), so the test fails if the scanner emits both the new names and the old suffixed variants; apply this change in the single-type fields block around the existing asserts that reference batch and column_by_name.crates/logfwd-arrow/src/streaming_builder.rs (1)
271-324:⚠️ Potential issue | 🔴 CriticalPrevent emitted-name collisions before building the schema.
This naming rule can now generate the same output name for different fields: e.g. a mixed-type
statusemitsstatus_int, while a separate single-type field literally namedstatus_intalso emits barestatus_int. The same problem exists for a user field named_rawwhenkeep_rawlater adds the reserved_rawcolumn. That yields an ambiguous batch where one array is effectively hidden by name.💡 Fail fast on duplicate output names
+ let mut emitted_names = std::collections::HashSet::new(); + if self.keep_raw && !self.raw_views.is_empty() { + emitted_names.insert("_raw".to_string()); + } + + let mut reserve_name = |name: &str| -> Result<(), ArrowError> { + if emitted_names.insert(name.to_string()) { + Ok(()) + } else { + Err(ArrowError::InvalidArgumentError(format!( + "duplicate output column name: {name}" + ))) + } + }; + for fc in &self.fields { // Field names come from JSON keys (valid UTF-8 in well-formed input). // Use from_utf8_lossy so that fuzz inputs with arbitrary bytes are // handled gracefully instead of triggering undefined behaviour. let name = String::from_utf8_lossy(&fc.name); @@ if fc.has_int { let col_name = if conflict { format!("{}_int", name) } else { name.to_string() }; + reserve_name(&col_name)?; let mut values = vec![0i64; num_rows]; let mut valid = vec![false; num_rows]; ...Apply the same
reserve_name(&col_name)?check in the float/string branches as well.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/logfwd-arrow/src/streaming_builder.rs` around lines 271 - 324, The schema-building can emit duplicate output names; ensure you call reserve_name(&col_name)? before adding any column in the Float and Str branches (i.e. inside the fc.has_float and fc.has_str blocks, just before schema_fields.push and arrays.push) the same way the Int branch does, so duplicate output names (including collisions with the reserved "_raw" when keep_raw is used) return an error instead of silently hiding arrays; use the existing reserve_name function and the local col_name variable in those branches.crates/logfwd-transform/src/udf/json_extract.rs (1)
242-257:⚠️ Potential issue | 🟡 MinorAdd a regression for quoted numeric strings on the bare-name path.
The current suite still only covers conflict batches, so this branch never runs against a single-type string column like
{"status":"200"}or{"duration":"1.5"}. That leaves the PR’s new “quoted numbers return NULL” behavior unpinned.💡 Suggested regression cases
+ #[tokio::test] + async fn test_json_int_quoted_number_returns_null() { + let batch = make_raw_batch(vec![r#"{"status": "200"}"#]); + let result = query("SELECT json_int(_raw, 'status') as s FROM logs", batch).await; + let col = result + .column(0) + .as_any() + .downcast_ref::<Int64Array>() + .unwrap(); + assert!(col.is_null(0)); + } + + #[tokio::test] + async fn test_json_float_quoted_number_returns_null() { + let batch = make_raw_batch(vec![r#"{"duration": "1.5"}"#]); + let result = query("SELECT json_float(_raw, 'duration') as d FROM logs", batch).await; + let col = result + .column(0) + .as_any() + .downcast_ref::<Float64Array>() + .unwrap(); + assert!(col.is_null(0)); + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/logfwd-transform/src/udf/json_extract.rs` around lines 242 - 257, Add regression tests that exercise the bare-name fallback path when the column is a single-type string so the branches in JsonExtractMode::Int and JsonExtractMode::Float run; specifically, create inputs like a string column containing JSON rows {"status":"200"} and {"duration":"1.5"} and assert that json_extract returns nulls (i.e. matches the code paths that produce arrow::array::new_null_array(&DataType::Int64, ...) and new_null_array(&DataType::Float64, ...)). Target the UDF/test that invokes the JsonExtract logic (so the arr variable, DataType checks, and cast logic for DataType::Int64/Float64 are covered) to pin the regression for quoted numeric strings on the bare-name path.crates/logfwd-core/tests/scanner_conformance.rs (2)
95-148:⚠️ Potential issue | 🟠 MajorMake the suffixed→bare fallback fail on type mismatches.
Once
col_nameis chosen, these branches should fail if that column has the wrong array type. Right now they silently skip assertions when a type mismatch occurs, which can hide typing regressions and turn them into false-green conformance runs.💡 Make the fallback strict
let col_name = if batch.column_by_name(&suffixed).is_some() { suffixed } else { key_str.to_string() }; - if let Some(col) = batch.column_by_name(&col_name) - && let Some(arr) = col.as_any().downcast_ref::<StringArray>() - && !arr.is_null(row) - { + let col = batch + .column_by_name(&col_name) + .unwrap_or_else(|| panic!("missing expected column '{col_name}'")); + let arr = col + .as_any() + .downcast_ref::<StringArray>() + .unwrap_or_else(|| { + panic!("column '{col_name}' has wrong type: {:?}", col.data_type()) + }); + if !arr.is_null(row) { let actual = arr.value(row); ... }Apply the same pattern to the
Int64ArrayandFloat64Arraybranches as well.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/logfwd-core/tests/scanner_conformance.rs` around lines 95 - 148, The branches pick col_name by falling back from suffixed to bare but currently silently skip when the existing column has the wrong array type; change each branch (StringArray, Int64Array, Float64Array handling around batch.column_by_name, arr.as_any().downcast_ref and col_name) to first fetch column by name and then assert/fail if the downcast to the expected array type returns None (e.g., assert!(col.as_any().downcast_ref::<StringArray>().is_some(), "type mismatch for {col_name} at row {row}: expected StringArray"), instead of silently skipping), and apply the same strict check to the Int64Array and Float64Array branches so type mismatches surface as test failures.
318-330:⚠️ Potential issue | 🟡 MinorAssert both
st_colare string-like before casting to Utf8.The Int64Array and Float64Array branches use
if let Some()on s_col thenunwrap()on st_col, establishing type agreement. The string branch only checks s_col's type before casting both columns to Utf8, allowing numeric st_col values to be silently stringified and hide type mismatches. Add an assertion matching the s_col check to catch builder inconsistencies.Suggested fix
if matches!( s_col.data_type(), arrow::datatypes::DataType::Utf8 | arrow::datatypes::DataType::Utf8View | arrow::datatypes::DataType::LargeUtf8 ) { + assert!( + matches!( + st_col.data_type(), + arrow::datatypes::DataType::Utf8 + | arrow::datatypes::DataType::Utf8View + | arrow::datatypes::DataType::LargeUtf8 + ), + "builder type mismatch at {col_name}: left={:?}, right={:?}", + s_col.data_type(), + st_col.data_type(), + ); let s_val = arrow::compute::cast(s_col, &arrow::datatypes::DataType::Utf8).unwrap(); let st_val = arrow::compute::cast(st_col, &arrow::datatypes::DataType::Utf8).unwrap();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/logfwd-core/tests/scanner_conformance.rs` around lines 318 - 330, The string branch currently only checks s_col.data_type() before casting both s_col and st_col to Utf8, which can mask type mismatches; update the check to assert that st_col is string-like as well (mirror the pattern used in the Int64Array/Float64Array branches): perform a type check or an if-let downcast on s_col (e.g., downcast_ref::<StringArray>()) and likewise assert or use if-let on st_col (st_col.as_any().downcast_ref::<StringArray>()) before calling arrow::compute::cast or unwrap, so both s_col and st_col are confirmed Utf8/string arrays prior to casting.crates/logfwd/tests/integration.rs (1)
483-507:⚠️ Potential issue | 🟡 MinorStale comments still describe the old suffix-based schema.
The code now uses bare names, but comments still reference
{field}_{type}andteam_str, which is misleading during test maintenance/debugging.Proposed comment-only cleanup
- // CSV columns use plain names (no `_str` suffix); scanner columns use the - // `{field}_{type}` convention. The alias brings the enriched column into - // the logfwd naming scheme for downstream compatibility. + // Both scanner and CSV columns are addressed via bare names for this + // single-type dataset. @@ - // The output must contain the `team_str` column from the CSV. + // The output must contain the `team` column from the CSV.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/logfwd/tests/integration.rs` around lines 483 - 507, Update the stale comments in the SqlTransform test: remove references to the old "{field}_{type}" convention and "team_str" and instead state that CSV columns use bare names (e.g. "team") and the SQL alias brings that field into the enriched output; adjust the comment above the SQL string and the assertion comment that references schema.field_with_name("team")/result.schema() to reflect the current bare-name behavior and avoid mentioning suffix-based naming or legacy examples.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/logfwd-core/tests/compliance_data.rs`:
- Around line 702-703: The test uses misleading expect messages that reference
removed suffixed names; update the expect messages on the get_str/get_* calls
(e.g., the get_str(batch, "items", 0).expect("items_str") and similar get_*
calls like the one for "v_float") to reflect the actual bare lookup keys (e.g.,
expect("items") and expect("v_float")), and make the same change for the other
occurrence at the second location mentioned so the failure text matches the
queried key names; locate calls to get_str/get_* and replace the suffixed
strings in their expect(...) arguments accordingly.
In `@dev-docs/research/type-suffix-redesign.md`:
- Around line 61-73: Update the "AnalyzerRule + TableProvider for type-conflict
batches" section to clearly mark it as a planned design for issue `#625` and add
one sentence describing the current implementation: that code currently
registers a plain MemTable for tables without an AnalyzerRule/TableProvider, and
that conflict batches are not yet routed via AnalyzerRule/TableProvider in the
present implementation (see use of MemTable in the transform code). Use the
exact headings/terms "AnalyzerRule", "TableProvider", "MemTable", and reference
"#625" so readers know this is a future change and how the current behavior
differs.
---
Outside diff comments:
In `@crates/logfwd-arrow/src/scanner.rs`:
- Around line 193-213: The test currently only checks that "host" and "status"
columns exist but not that legacy suffixed names were omitted; update the
single-type fields assertions to also assert absence of the legacy suffixed
names by checking batch.column_by_name("host_str").is_none() and
batch.column_by_name("status_int").is_none() (or equivalent asserts), so the
test fails if the scanner emits both the new names and the old suffixed
variants; apply this change in the single-type fields block around the existing
asserts that reference batch and column_by_name.
In `@crates/logfwd-arrow/src/streaming_builder.rs`:
- Around line 271-324: The schema-building can emit duplicate output names;
ensure you call reserve_name(&col_name)? before adding any column in the Float
and Str branches (i.e. inside the fc.has_float and fc.has_str blocks, just
before schema_fields.push and arrays.push) the same way the Int branch does, so
duplicate output names (including collisions with the reserved "_raw" when
keep_raw is used) return an error instead of silently hiding arrays; use the
existing reserve_name function and the local col_name variable in those
branches.
In `@crates/logfwd-core/tests/scanner_conformance.rs`:
- Around line 95-148: The branches pick col_name by falling back from suffixed
to bare but currently silently skip when the existing column has the wrong array
type; change each branch (StringArray, Int64Array, Float64Array handling around
batch.column_by_name, arr.as_any().downcast_ref and col_name) to first fetch
column by name and then assert/fail if the downcast to the expected array type
returns None (e.g.,
assert!(col.as_any().downcast_ref::<StringArray>().is_some(), "type mismatch for
{col_name} at row {row}: expected StringArray"), instead of silently skipping),
and apply the same strict check to the Int64Array and Float64Array branches so
type mismatches surface as test failures.
- Around line 318-330: The string branch currently only checks s_col.data_type()
before casting both s_col and st_col to Utf8, which can mask type mismatches;
update the check to assert that st_col is string-like as well (mirror the
pattern used in the Int64Array/Float64Array branches): perform a type check or
an if-let downcast on s_col (e.g., downcast_ref::<StringArray>()) and likewise
assert or use if-let on st_col (st_col.as_any().downcast_ref::<StringArray>())
before calling arrow::compute::cast or unwrap, so both s_col and st_col are
confirmed Utf8/string arrays prior to casting.
In `@crates/logfwd-transform/src/udf/json_extract.rs`:
- Around line 242-257: Add regression tests that exercise the bare-name fallback
path when the column is a single-type string so the branches in
JsonExtractMode::Int and JsonExtractMode::Float run; specifically, create inputs
like a string column containing JSON rows {"status":"200"} and
{"duration":"1.5"} and assert that json_extract returns nulls (i.e. matches the
code paths that produce arrow::array::new_null_array(&DataType::Int64, ...) and
new_null_array(&DataType::Float64, ...)). Target the UDF/test that invokes the
JsonExtract logic (so the arr variable, DataType checks, and cast logic for
DataType::Int64/Float64 are covered) to pin the regression for quoted numeric
strings on the bare-name path.
In `@crates/logfwd/tests/integration.rs`:
- Around line 483-507: Update the stale comments in the SqlTransform test:
remove references to the old "{field}_{type}" convention and "team_str" and
instead state that CSV columns use bare names (e.g. "team") and the SQL alias
brings that field into the enriched output; adjust the comment above the SQL
string and the assertion comment that references
schema.field_with_name("team")/result.schema() to reflect the current bare-name
behavior and avoid mentioning suffix-based naming or legacy examples.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6c66182f-15c2-4a97-982e-30884a6cec9b
📒 Files selected for processing (15)
.github/workflows/ci.ymlcrates/logfwd-arrow/src/scanner.rscrates/logfwd-arrow/src/storage_builder.rscrates/logfwd-arrow/src/streaming_builder.rscrates/logfwd-core/tests/compliance_data.rscrates/logfwd-core/tests/scanner_conformance.rscrates/logfwd-transform/src/lib.rscrates/logfwd-transform/src/rewriter.rscrates/logfwd-transform/src/udf/json_extract.rscrates/logfwd-transform/tests/raw_first_bench.rscrates/logfwd-transform/tests/scanner_datafusion_boundary.rscrates/logfwd/src/pipeline.rscrates/logfwd/tests/compliance.rscrates/logfwd/tests/integration.rsdev-docs/research/type-suffix-redesign.md
💤 Files with no reviewable changes (2)
- crates/logfwd-transform/src/lib.rs
- crates/logfwd-transform/src/rewriter.rs
- json_extract: add regression tests for json_int/json_float returning null when the field is a quoted string (bare Utf8 column, no conflict) - scanner_conformance oracle: panic on type mismatch instead of silently skipping assertions; fall back bare→suffixed column lookup so single- type fields are actually verified - scanner_conformance oracle: accept Int64 column when checking a float value (e.g. -0 has no decimal/exponent so scanner emits Int64) - assert_builders_consistent: assert st_col is also string-typed before casting to Utf8 for value comparison Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/logfwd-transform/tests/scanner_datafusion_boundary.rs (1)
570-590: 🧹 Nitpick | 🔵 TrivialAssert the full ordered result here.
This now only checks the total row count and the
ERRORbucket, so it would still pass ifORDER BY cnt DESC, level ASCwere wrong forINFOorDEBUG. Please assert the full row order for this path.✅ Stronger assertion
- // ERROR×2 comes first (cnt DESC), then DEBUG×1 and INFO×2 tie — but INFO - // and ERROR both have 2, and ERROR sorts before INFO alphabetically when - // cnt is equal. Let's just verify the total count. - let total: i64 = counts.iter().map(|v| v.unwrap_or(0)).sum(); - assert_eq!(total, 5); - // ERROR must appear with count 2. - let error_pos = levels.iter().position(|v| v == "ERROR").unwrap(); - assert_eq!(counts[error_pos], Some(2)); + assert_eq!(levels, ["ERROR", "INFO", "DEBUG"]); + assert_eq!(counts, [Some(2), Some(2), Some(1)]);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/logfwd-transform/tests/scanner_datafusion_boundary.rs` around lines 570 - 590, The test currently only checks total and ERROR bucket; instead assert the full ordered result returned by SqlTransform::new(...).execute_blocking(batch) by comparing the collected levels and counts (from collect_string_col and collect_i64_col) to the expected ordered vectors: levels should equal ["ERROR","INFO","DEBUG"] and counts should equal [Some(2), Some(2), Some(1)] (or the appropriate Option<i64> form used in the test); update the assertions to compare these exact vectors so ORDER BY cnt DESC, level ASC is fully validated.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/logfwd-transform/src/conflict_schema.rs`:
- Around line 64-77: The grouping logic using strip_conflict_suffix and groups
currently treats any shared suffixes as a conflict and may synthesize a bare
column for literal sibling keys; change the discriminator so groups only survive
if (a) the suffixes are in a vetted set of known conflict suffixes (e.g. a
KNOWN_CONFLICT_SUFFIXES list used by strip_conflict_suffix) AND (b) the member
fields in schema.fields() show conflicting representations (e.g. differing data
types or explicit scanner-emitted conflict metadata on the Field/Schema) —
otherwise drop the group; update the groups.retain call to perform these checks
(and reference existing_bare, strip_conflict_suffix, and the per-field
type/metadata) and add a regression test exercising literal siblings vs real
conflict variants.
- Around line 83-108: The appended synthesized columns come from `groups` (a
HashMap) and are added in arbitrary order causing schema flapping; convert
`groups` into a deterministic Vec sorted by the minimum source-column index of
each group's `members` before the for-loop that builds
`extra_fields`/`extra_arrays` (the code that calls `merge_to_utf8`, pushes to
`extra_fields` and `extra_arrays`, and uses `batch.num_rows()`); compute the min
index for each `(base, members)` entry, sort by that min, then iterate the
sorted vector so the order of pushed fields/arrays (and thus the resulting
`new_schema`) is stable across runs.
In `@crates/logfwd-transform/src/lib.rs`:
- Around line 598-603: The code currently calls
conflict_schema::normalize_conflict_columns(batch) which mutates the physical
RecordBatch (the `batch` used to construct the MemTable), causing synthetic bare
columns to become part of `logs` and breaking SELECT * / wildcard round-trips;
instead stop normalizing the RecordBatch before MemTable registration and move
the bare-name aliasing into the query planning/projection stage (so the MemTable
stores the original physical schema). Concretely: revert removal of columns from
the RecordBatch (undo the call to conflict_schema::normalize_conflict_columns
when building the MemTable), keep registering the original `batch` in the
MemTable, and implement the synthetic `status: Utf8` aliasing inside the
planner/projection codepath that prepares projections for execution (referencing
the same conflict_schema logic but applied to projection expressions rather than
mutating `batch`).
In `@crates/logfwd-transform/src/udf/json_extract.rs`:
- Around line 59-68: The suffix_order() mapping is fine but the lookup logic
that uses it (e.g., in the json(...) and json_float(...) extraction paths) must
not pick the first existing column at batch-level; instead implement row-wise
coalescing across the ordered suffix variants: for each field, generate an
expression that checks each variant column in suffix_order(self) in preference
order and returns the first non-null/valid per row (e.g., COALESCE or
conditional pick with IS NOT NULL and appropriate casting/parsing for
_int/_float/_str), rather than selecting a single column for the whole batch;
update the codepaths that currently short-circuit on the first matching column
(referencing functions json, json_float, and the suffix_order method) to build
per-row coalesce logic so mixed-type batches yield per-row values correctly.
---
Outside diff comments:
In `@crates/logfwd-transform/tests/scanner_datafusion_boundary.rs`:
- Around line 570-590: The test currently only checks total and ERROR bucket;
instead assert the full ordered result returned by
SqlTransform::new(...).execute_blocking(batch) by comparing the collected levels
and counts (from collect_string_col and collect_i64_col) to the expected ordered
vectors: levels should equal ["ERROR","INFO","DEBUG"] and counts should equal
[Some(2), Some(2), Some(1)] (or the appropriate Option<i64> form used in the
test); update the assertions to compare these exact vectors so ORDER BY cnt
DESC, level ASC is fully validated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 01e14670-2d66-46b1-b4d9-2594531f03b8
📒 Files selected for processing (5)
crates/logfwd-core/src/framer.rscrates/logfwd-transform/src/conflict_schema.rscrates/logfwd-transform/src/lib.rscrates/logfwd-transform/src/udf/json_extract.rscrates/logfwd-transform/tests/scanner_datafusion_boundary.rs
| let mut extra_fields: Vec<Field> = Vec::with_capacity(groups.len()); | ||
| let mut extra_arrays: Vec<Arc<dyn Array>> = Vec::with_capacity(groups.len()); | ||
|
|
||
| for (base, members) in &groups { | ||
| let int_col = members | ||
| .iter() | ||
| .find(|(s, _)| *s == "_int") | ||
| .map(|(_, idx)| batch.column(*idx).as_ref()); | ||
| let float_col = members | ||
| .iter() | ||
| .find(|(s, _)| *s == "_float") | ||
| .map(|(_, idx)| batch.column(*idx).as_ref()); | ||
| let str_col = members | ||
| .iter() | ||
| .find(|(s, _)| *s == "_str") | ||
| .map(|(_, idx)| batch.column(*idx).as_ref()); | ||
|
|
||
| let merged = merge_to_utf8(int_col, float_col, str_col, batch.num_rows()); | ||
| extra_fields.push(Field::new(*base, DataType::Utf8, true)); | ||
| extra_arrays.push(merged); | ||
| } | ||
|
|
||
| // Append the computed bare columns to the existing schema and arrays. | ||
| let mut fields: Vec<Field> = schema.fields().iter().map(|f| (**f).clone()).collect(); | ||
| fields.extend(extra_fields); | ||
| let new_schema = Arc::new(Schema::new(fields)); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the file structure and context
head -120 crates/logfwd-transform/src/conflict_schema.rs | tail -50Repository: strawgate/memagent
Length of output: 1865
🏁 Script executed:
# Check what `groups` is - need to see the full function context
sed -n '50,120p' crates/logfwd-transform/src/conflict_schema.rsRepository: strawgate/memagent
Length of output: 3007
🏁 Script executed:
# Check the coding style documentation for error handling
cat dev-docs/CODE_STYLE.md | grep -A 10 -B 2 "unwrap\|error\|production"Repository: strawgate/memagent
Length of output: 1505
🏁 Script executed:
# Look for HashMap imports and confirm the type of `groups`
rg "HashMap|groups" crates/logfwd-transform/src/conflict_schema.rs -B 5 -A 2Repository: strawgate/memagent
Length of output: 2402
Stabilize synthesized-column order to prevent schema flapping.
groups is a HashMap, so appended bare columns come out in arbitrary order. With multiple conflict groups, this causes SELECT * and schema-order-sensitive downstream consumers to produce different results between runs. Sort by the minimum source-column index within each group before appending.
♻️ One way to stabilize the append order
- let mut extra_fields: Vec<Field> = Vec::with_capacity(groups.len());
- let mut extra_arrays: Vec<Arc<dyn Array>> = Vec::with_capacity(groups.len());
-
- for (base, members) in &groups {
+ let mut ordered_groups: Vec<_> = groups.iter().collect();
+ ordered_groups.sort_by_key(|(_, members)| {
+ members
+ .iter()
+ .map(|(_, idx)| *idx)
+ .min()
+ .unwrap_or(usize::MAX)
+ });
+
+ let mut extra_fields: Vec<Field> = Vec::with_capacity(ordered_groups.len());
+ let mut extra_arrays: Vec<Arc<dyn Array>> = Vec::with_capacity(ordered_groups.len());
+
+ for (base, members) in ordered_groups {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let mut extra_fields: Vec<Field> = Vec::with_capacity(groups.len()); | |
| let mut extra_arrays: Vec<Arc<dyn Array>> = Vec::with_capacity(groups.len()); | |
| for (base, members) in &groups { | |
| let int_col = members | |
| .iter() | |
| .find(|(s, _)| *s == "_int") | |
| .map(|(_, idx)| batch.column(*idx).as_ref()); | |
| let float_col = members | |
| .iter() | |
| .find(|(s, _)| *s == "_float") | |
| .map(|(_, idx)| batch.column(*idx).as_ref()); | |
| let str_col = members | |
| .iter() | |
| .find(|(s, _)| *s == "_str") | |
| .map(|(_, idx)| batch.column(*idx).as_ref()); | |
| let merged = merge_to_utf8(int_col, float_col, str_col, batch.num_rows()); | |
| extra_fields.push(Field::new(*base, DataType::Utf8, true)); | |
| extra_arrays.push(merged); | |
| } | |
| // Append the computed bare columns to the existing schema and arrays. | |
| let mut fields: Vec<Field> = schema.fields().iter().map(|f| (**f).clone()).collect(); | |
| fields.extend(extra_fields); | |
| let new_schema = Arc::new(Schema::new(fields)); | |
| let mut ordered_groups: Vec<_> = groups.iter().collect(); | |
| ordered_groups.sort_by_key(|(_, members)| { | |
| members | |
| .iter() | |
| .map(|(_, idx)| *idx) | |
| .min() | |
| .unwrap_or(usize::MAX) | |
| }); | |
| let mut extra_fields: Vec<Field> = Vec::with_capacity(ordered_groups.len()); | |
| let mut extra_arrays: Vec<Arc<dyn Array>> = Vec::with_capacity(ordered_groups.len()); | |
| for (base, members) in ordered_groups { | |
| let int_col = members | |
| .iter() | |
| .find(|(s, _)| *s == "_int") | |
| .map(|(_, idx)| batch.column(*idx).as_ref()); | |
| let float_col = members | |
| .iter() | |
| .find(|(s, _)| *s == "_float") | |
| .map(|(_, idx)| batch.column(*idx).as_ref()); | |
| let str_col = members | |
| .iter() | |
| .find(|(s, _)| *s == "_str") | |
| .map(|(_, idx)| batch.column(*idx).as_ref()); | |
| let merged = merge_to_utf8(int_col, float_col, str_col, batch.num_rows()); | |
| extra_fields.push(Field::new(*base, DataType::Utf8, true)); | |
| extra_arrays.push(merged); | |
| } | |
| // Append the computed bare columns to the existing schema and arrays. | |
| let mut fields: Vec<Field> = schema.fields().iter().map(|f| (**f).clone()).collect(); | |
| fields.extend(extra_fields); | |
| let new_schema = Arc::new(Schema::new(fields)); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/logfwd-transform/src/conflict_schema.rs` around lines 83 - 108, The
appended synthesized columns come from `groups` (a HashMap) and are added in
arbitrary order causing schema flapping; convert `groups` into a deterministic
Vec sorted by the minimum source-column index of each group's `members` before
the for-loop that builds `extra_fields`/`extra_arrays` (the code that calls
`merge_to_utf8`, pushes to `extra_fields` and `extra_arrays`, and uses
`batch.num_rows()`); compute the min index for each `(base, members)` entry,
sort by that min, then iterate the sorted vector so the order of pushed
fields/arrays (and thus the resulting `new_schema`) is stable across runs.
| // | ||
| // Normalize the batch first: if the scanner detected type conflicts it | ||
| // emits suffixed columns (`status_int`, `status_str`). Add a bare | ||
| // `status: Utf8` column so SQL using bare names resolves on both clean | ||
| // and conflict batches. | ||
| let batch = conflict_schema::normalize_conflict_columns(batch); |
There was a problem hiding this comment.
This mutates the physical schema seen by SELECT *.
Normalizing before MemTable registration makes the synthetic bare columns part of logs, so conflict batches now expose both the typed variants and the synthetic view column. That breaks round-trip / wildcard semantics; the bare-name aliasing needs to happen in planning or projection, not by altering the input batch.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@crates/logfwd-transform/src/lib.rs` around lines 598 - 603, The code
currently calls conflict_schema::normalize_conflict_columns(batch) which mutates
the physical RecordBatch (the `batch` used to construct the MemTable), causing
synthetic bare columns to become part of `logs` and breaking SELECT * / wildcard
round-trips; instead stop normalizing the RecordBatch before MemTable
registration and move the bare-name aliasing into the query planning/projection
stage (so the MemTable stores the original physical schema). Concretely: revert
removal of columns from the RecordBatch (undo the call to
conflict_schema::normalize_conflict_columns when building the MemTable), keep
registering the original `batch` in the MemTable, and implement the synthetic
`status: Utf8` aliasing inside the planner/projection codepath that prepares
projections for execution (referencing the same conflict_schema logic but
applied to projection expressions rather than mutating `batch`).
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
dev-docs/research/type-suffix-redesign.md (1)
194-200:⚠️ Potential issue | 🟠 MajorMark schema padding as planned, not current behavior.
The PR objectives for
#684explicitly leave schema padding to#625, but this section reads as if missing SQL-referenced columns are already padded on every batch. Please move this to a planned/future note and separate it from what the current PR actually implements.📝 Suggested wording
-At config time, `QueryAnalyzer` extracts `referenced_columns` from the -user's SQL. Before registering each batch as a MemTable, pad with null -columns for any SQL-referenced columns missing from the batch. +Planned (`#625`): at config time, `QueryAnalyzer` will extract +`referenced_columns` from the user's SQL and pad missing +SQL-referenced columns with nulls. + +Current PR scope stops earlier: conflict batches are normalized for +bare-name resolution, but cross-batch padding is still future work.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dev-docs/research/type-suffix-redesign.md` around lines 194 - 200, The doc currently states that QueryAnalyzer extracts referenced_columns and that batches are padded with nulls and normalize_conflict_columns synthesizes columns; update this text to clearly mark schema padding as planned work (to be handled by issue `#625`) rather than current behavior: change language around QueryAnalyzer, MemTable batch padding, and normalize_conflict_columns to indicate these are future/planned features, separate them from the implemented behavior for conflict batches, and add a short note linking the planned behavior to the corresponding issue number so readers understand it is not yet implemented.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/logfwd-core/tests/scanner_conformance.rs`:
- Around line 99-102: The test currently silently skips validation when neither
the suffixed nor bare column exists because col is set via .or_else(||
batch.column_by_name(...)) and then matched with if let Some(col), which masks
missing columns; change this to eagerly fail by replacing that lookup with
expect or unwrap_or_else (e.g., let col =
batch.column_by_name(&format!("{key_str}__str")).or_else(||
batch.column_by_name(key_str)).expect(&format!("missing column for key: {}",
key_str));) so a missing __str/__int/__float or bare column panics the test
immediately; apply the same change for the other two occurrences that resolve
columns for __int and __float as well.
- Around line 164-171: The Int64 fallback in the test is using a too-large
tolerance `(actual - expected).abs() < 1.0`; tighten this by asserting exact
equality for integer-derived floats or a tiny epsilon instead. Update the
assertion in the Int64Array branch (where `actual = arr.value(row) as f64` and
`expected` are compared) to either use `actual == expected` for semantically
integral spellings or `(actual - expected).abs() < EPSILON` with a small epsilon
(e.g., 1e-12 or f64::EPSILON) so that values like `1.9` won't be mistaken for
`1`. Ensure the assertion message remains helpful and references `key_str` and
`row`.
In `@dev-docs/research/column-type-constraints.md`:
- Around line 112-117: The current wording in "Config declares type hints;
reader infers by default." incorrectly asserts the default path satisfies C3;
update the text to state the reader infers types by default and that this
default behavior satisfies C1 (per-batch correctness) only, and explicitly note
that C3 (cross-batch schema stability) and C8 are guaranteed only when a field
is pinned via config (e.g., schema: { status: int }) or after implementing
additional schema-stability work; keep the example of schema pinning and remove
the claim that C3 is satisfied by the default data-driven path.
In `@dev-docs/research/type-suffix-redesign.md`:
- Around line 204-223: Update the Phase 10/10b/10c roadmap text to reflect PR
`#684` as implemented: mark Phase 10 as complete (change “to be updated” to done),
remove or move Phase 10b from “future work” to completed changes noting the
double-underscore rename in StreamingBuilder and StorageBuilder and the updates
to strip_conflict_suffix (conflict_schema.rs) and suffix_order
(json_extract.rs), and state that logfwd.conflict_groups schema metadata
stamping was added; also update Phase 10c to declare ConflictGroups and
TypedValue added to logfwd-output and note that OTLP, JSON Lines/TCP/UDP, and
stdout sinks now preserve types per-row and that tests for conflict batch
round-trips were added.
---
Duplicate comments:
In `@dev-docs/research/type-suffix-redesign.md`:
- Around line 194-200: The doc currently states that QueryAnalyzer extracts
referenced_columns and that batches are padded with nulls and
normalize_conflict_columns synthesizes columns; update this text to clearly mark
schema padding as planned work (to be handled by issue `#625`) rather than current
behavior: change language around QueryAnalyzer, MemTable batch padding, and
normalize_conflict_columns to indicate these are future/planned features,
separate them from the implemented behavior for conflict batches, and add a
short note linking the planned behavior to the corresponding issue number so
readers understand it is not yet implemented.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: ead96112-98f0-46db-b525-6cc7be49e901
📒 Files selected for processing (11)
crates/logfwd-arrow/src/lib.rscrates/logfwd-arrow/src/scanner.rscrates/logfwd-arrow/src/storage_builder.rscrates/logfwd-arrow/src/streaming_builder.rscrates/logfwd-core/tests/compliance_data.rscrates/logfwd-core/tests/scanner_conformance.rscrates/logfwd-transform/src/conflict_schema.rscrates/logfwd-transform/src/udf/json_extract.rscrates/logfwd-transform/tests/scanner_datafusion_boundary.rsdev-docs/research/column-type-constraints.mddev-docs/research/type-suffix-redesign.md
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/logfwd-transform/src/lib.rs (1)
275-292:⚠️ Potential issue | 🟠 MajorDon't strip real
__*field names during pushdown.Line 285 still treats any
foo__str/foo__int/foo__floatreference as a conflict variant. That breaks legitimate JSON keys with those endings:SELECT error__str FROM logsmakesscan_config()requesterror, so field pushdown drops the actual column. Strip only names that are known conflict groups, for example fromlogfwd.conflict_groups, instead of matching on the suffix alone.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/logfwd-transform/src/lib.rs` around lines 275 - 292, strip_type_suffix currently strips any name ending in "__str"/"__int"/"__float" which removes legitimate JSON keys; change it to only strip when the base name is a known conflict group. Modify strip_type_suffix (or its caller) to consult the configured conflict groups (e.g. logfwd.conflict_groups) and only return the base when the base appears in that set of conflict group names (otherwise return the original name unchanged); reference the function name strip_type_suffix and the conflict groups configuration (logfwd.conflict_groups) so the check is implemented against the canonical list rather than just matching suffixes.
♻️ Duplicate comments (1)
crates/logfwd-transform/src/lib.rs (1)
607-611:⚠️ Potential issue | 🟠 MajorKeep the physical
logsschema untouched.Line 611 still feeds a normalized batch into the
MemTable, so conflict batches expose both the typed variants and the synthetic bare alias throughSELECT *. That breaks the round-trip / wildcard semantics this PR is trying to preserve. Apply bare-name aliasing in planning/projection instead of mutating theRecordBatchbefore table registration.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/logfwd-transform/src/lib.rs` around lines 607 - 611, The code currently mutates the physical RecordBatch by calling conflict_schema::normalize_conflict_columns(batch) and then registering that normalized batch in the MemTable, which exposes both typed variants and the synthetic bare alias to SELECT *; revert this by keeping the original batch intact for table registration (register the unmodified batch with MemTable) and move the bare-name aliasing logic out of conflict_schema::normalize_conflict_columns into the query planning/projection stage so alias columns are synthesized at plan-time (not in the RecordBatch), e.g., stop passing the normalized batch into MemTable and instead apply the aliasing during projection/plan construction where the planner can map bare names to typed variants without mutating the stored batch.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@crates/logfwd-transform/src/lib.rs`:
- Around line 275-292: strip_type_suffix currently strips any name ending in
"__str"/"__int"/"__float" which removes legitimate JSON keys; change it to only
strip when the base name is a known conflict group. Modify strip_type_suffix (or
its caller) to consult the configured conflict groups (e.g.
logfwd.conflict_groups) and only return the base when the base appears in that
set of conflict group names (otherwise return the original name unchanged);
reference the function name strip_type_suffix and the conflict groups
configuration (logfwd.conflict_groups) so the check is implemented against the
canonical list rather than just matching suffixes.
---
Duplicate comments:
In `@crates/logfwd-transform/src/lib.rs`:
- Around line 607-611: The code currently mutates the physical RecordBatch by
calling conflict_schema::normalize_conflict_columns(batch) and then registering
that normalized batch in the MemTable, which exposes both typed variants and the
synthetic bare alias to SELECT *; revert this by keeping the original batch
intact for table registration (register the unmodified batch with MemTable) and
move the bare-name aliasing logic out of
conflict_schema::normalize_conflict_columns into the query planning/projection
stage so alias columns are synthesized at plan-time (not in the RecordBatch),
e.g., stop passing the normalized batch into MemTable and instead apply the
aliasing during projection/plan construction where the planner can map bare
names to typed variants without mutating the stored batch.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 87960c00-6704-4c81-9120-fe564d66251a
📒 Files selected for processing (2)
crates/logfwd-transform/src/lib.rscrates/logfwd-transform/tests/scanner_datafusion_boundary.rs
…#445) Single-type fields now use bare column names (`status`, `level`) with their native Arrow type. Suffixed names (`status_int`, `status_str`) only appear when the same field has multiple types within a single batch. Changes: - StreamingBuilder + StorageBuilder: conflict detection in finish_batch() - Delete rewriter.rs (775 lines of dead SQL text rewriter, never wired in) - json_extract UDF: suffix_order tries bare name as fallback; non-numeric columns return null instead of coercing strings to numbers - All tests updated to expect bare names for single-type fields - CI: set RUSTC_WRAPPER="" so cargo test works without sccache Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- json_extract: add regression tests for json_int/json_float returning null when the field is a quoted string (bare Utf8 column, no conflict) - scanner_conformance oracle: panic on type mismatch instead of silently skipping assertions; fall back bare→suffixed column lookup so single- type fields are actually verified - scanner_conformance oracle: accept Int64 column when checking a float value (e.g. -0 has no decimal/exponent so scanner emits Int64) - assert_builders_consistent: assert st_col is also string-typed before casting to Utf8 for value comparison Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…atches (#625) When the scanner detects a type conflict for a field (e.g. `status` appears as both int and string across rows), it emits `status_int: Int64` and `status_str: Utf8View`. SQL using the bare name `status` would fail to resolve. Add `normalize_conflict_columns()` in the new `conflict_schema` module: - Detects conflict groups: ≥2 suffixed variants of the same base name with no existing bare column - Adds a computed `status: Utf8` column via COALESCE(int→str, float→str, str) - Single lone `foo_str` columns (field literally named `foo_str`) are NOT treated as conflicts (require ≥2 variants) Wire into `SqlTransform::execute()` before MemTable registration so every batch is normalized before DataFusion sees it. After this change `SELECT status FROM logs` works on both clean batches (bare `status: Int64` from scanner) and conflict batches (synthesized `status: Utf8`). Users call `int(status)` / `float(status)` for numeric ops. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…, ConflictGroups - type-suffix-redesign.md: document __ suffix convention, logfwd.conflict_groups metadata key, ConflictGroups/TypedValue output abstraction, implementation phases - column-type-constraints.md: answer all 6 open questions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ct_groups metadata Rename conflict column suffixes from single-underscore (_int/_str/_float) to double-underscore (__int/__str/__float) so real field names like `status_int` cannot collide with synthesized conflict columns. Stamp `logfwd.conflict_groups` Arrow schema metadata key when builders detect type conflicts (format: "status:int,str;duration:float,int"). Zero overhead when no conflicts in the batch. Updates: - storage_builder + streaming_builder: emit __int/__float/__str on conflict, accumulate conflict_meta, attach CONFLICT_GROUPS_METADATA_KEY to schema - lib.rs: re-export CONFLICT_GROUPS_METADATA_KEY - conflict_schema.rs: CONFLICT_SUFFIXES → __ prefix, HashMap → BTreeMap for deterministic ordering, preserve schema metadata in normalize_conflict_columns - json_extract.rs: suffix_order updated for __ prefixes + bare-name fallback - All tests updated: compliance_data, scanner_conformance, scanner_datafusion_boundary (Section 5 conflict tests: 4 new tests for bare-name SQL on conflict batches) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… test fixtures
The strip_type_suffix helper was stripping single-underscore suffixes (_str,
_int, _float), which caused two bugs after Phase 10b:
1. Real JSON field names like `start_int` would be incorrectly mapped to
`start` for scanner pushdown — the exact ambiguity Phase 10b was meant
to eliminate.
2. `strip_type_suffix("status__int")` returned `"status_"` (strips trailing
`_int`, leaves a dangling underscore), giving the scanner a garbage key.
Fix: strip __str/__int/__float (double underscore) only. For bare names (the
normal post-Phase-10 case) the function is now a no-op, which is correct.
Also:
- Update stale comments in lib.rs (line 106 and 600) and the Section 5 block
in scanner_datafusion_boundary.rs to reference __ suffixes consistently.
- Update make_test_batch() and all 14 dependent unit tests to use bare column
names (level, msg, status, host, region, val, n) instead of the old
level_str/msg_str/… convention, matching what the Phase-10 scanner emits.
- test_filter_hints_typed_column_stripped now tests severity__int rather than
severity_int, exercising the actual double-underscore strip path.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- scanner: assert legacy _str/_int suffixed columns absent in single-type test
- streaming_builder + storage_builder: detect duplicate output column names (reserve_name guard)
- scanner_conformance: panic on missing scanner columns and wrong types instead of silently skipping
- integration: remove stale {field}_{type} and team_str comment references
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
d3c7854 to
deddbe3
Compare
Commit 085c2cf changed BatchMetadata.resource_attrs to Arc<Vec<...>> but didn't update the test constructors, breaking compilation on Linux/macOS CI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@coderabbitai please do a very thorough review |
Same 085c2cf regression — missed two more callsites in elasticsearch_arrow_ipc.rs and elasticsearch_arrow.rs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 235551
SummaryThe core design (bare names for single-type, 🔴 Critical1.
|
conflict_schema: use logfwd.conflict_groups metadata as authoritative discriminator instead of suffix heuristics — prevents false-positive synthesis on user fields literally named foo__int / foo__str json_extract: Str mode coalesces all conflict variants row-by-row so json(_raw,'status') returns "200" not null when status is int in that row scanner_conformance: tighten Int64 fallback tolerance from < 1.0 to exact equality; fix it uses col.unwrap_or_else (already done); fix loose test scanner_datafusion_boundary: stamp logfwd.conflict_groups metadata on test conflict batches to match real builder output docs: mark AnalyzerRule/TableProvider section as planned (#625), not current; fix C3 claim in column-type-constraints.md; refresh phase breakdown in type-suffix-redesign.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Explains that C3 is satisfied via the planned TableProvider/AnalyzerRule approach: referenced columns are advertised as Utf8 with cast rewrites to typed backing columns, only for columns the query actually uses. normalize_conflict_columns is a batch-level approximation; #625 replaces it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- docs/COLUMN_NAMING.md: update to double-underscore suffixes (__int, __str, __float), document logfwd.conflict_groups metadata key, bare synthesized column for SQL, int()/float() UDF idiom, and cross-batch type instability warning - scanner_datafusion_boundary.rs: add cross_batch_int_udf_works test documenting that int(status) works on both clean and conflict batches - conflict_schema.rs: explain why Utf8 cast intentionally loses zero-copy for StreamingBuilder str columns (SQL transform path only) - storage_builder.rs: comment on _raw pre-reservation condition - streaming_builder.rs: use crate::CONFLICT_GROUPS_METADATA_KEY instead of crate::storage_builder::CONFLICT_GROUPS_METADATA_KEY (already re-exported at crate root) - scanner.rs: rename test field s → status in test_type_conflict Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- docs/COLUMN_NAMING.md: update to double-underscore suffixes (__int, __str, __float), document logfwd.conflict_groups metadata key, bare synthesized column for SQL, int()/float() UDF idiom, and cross-batch type instability warning - scanner_datafusion_boundary.rs: add cross_batch_int_udf_works test documenting that int(status) works on both clean and conflict batches - conflict_schema.rs: explain why Utf8 cast intentionally loses zero-copy for StreamingBuilder str columns (SQL transform path only) - storage_builder.rs: comment on _raw pre-reservation condition - streaming_builder.rs: use crate::CONFLICT_GROUPS_METADATA_KEY instead of crate::storage_builder::CONFLICT_GROUPS_METADATA_KEY (already re-exported at crate root) - scanner.rs: rename test field s → status in test_type_conflict Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Since the type-suffix redesign (#684), columns are only suffixed on type conflict. The bench SQL referenced timestamp_str, level_str, etc. which no longer exist, causing every transform to fail silently and produce 0 output lines with 7.2 GB RSS from unbounded buffering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Column naming convention section: update to describe actual behavior. Single-type fields use the base name (no suffix). Conflicting fields become a Struct column under the base name. Legacy _str/_int suffixes are not emitted. The old table described the pre-#684 behavior. - kubernetes.md: /metrics returns 410 Gone (not "does not expose"). Special columns table (_raw_str, _file_str etc.) left unchanged pending verification of actual emitted column names in code. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Implements the "suffix only on conflict" column naming design from #445.
status(Int64),level(Utf8)status_int(Int64) +status_str(Utf8View)rewriter.rs— 775 lines of SQL text rewriter that was never wired into the pipeline ([arrow-best-practices] SQL rewrite layer is implemented but never applied before DataFusion execution #602)The output layer already dispatched on Arrow
DataTyperather than column name suffix (commit 0321b18), so serialization is unaffected.Changes
streaming_builder.rs,storage_builder.rsfinish_batch()— bare name when single-type, suffixed on conflictrewriter.rsjson_extract.rssuffix_ordertries bare name as fallback; non-numeric columns return null instead of coercing"200"→200ci.ymlRUSTC_WRAPPER: ""socargo testworks in CI without sccachetype-suffix-redesign.mdTest plan
_int/_str/_floatsuffixesjson_extractUDF:json_inton a quoted string returns null (not the coerced int value)Closes
Partially closes #445 (steps 1-3 complete; schema padding and AnalyzerRule tracked in #625).
🤖 Generated with Claude Code