Skip to content

refactor: route csv enrichment through columnar builder#2268

Merged
strawgate merged 4 commits into
mainfrom
codex/csv-columnar-producer
Apr 19, 2026
Merged

refactor: route csv enrichment through columnar builder#2268
strawgate merged 4 commits into
mainfrom
codex/csv-columnar-producer

Conversation

@strawgate
Copy link
Copy Markdown
Owner

@strawgate strawgate commented Apr 19, 2026

Summary

  • Route the existing CSV enrichment producer through ColumnarBatchBuilder instead of hand-building StringArray columns.
  • Keep CSV delimiter, quote, header, and row-alignment semantics in logfwd-transform, outside logfwd-arrow.
  • Preserve CSV headers even when a column is entirely missing in the data by restoring all-null Utf8View columns after builder finalization.
  • Document the CSV Utf8View representation in the user config reference, architecture/design notes, verification inventory, and API comments.
  • Add CSV coverage for missing trailing cells, empty cells, quoted commas, all-null header preservation, and legacy value/null parity.

This addresses #1846 by proving a non-OTLP producer path through the shared builder without adding CSV parsing responsibilities to logfwd-arrow.

Verification

  • cargo fmt --check -p logfwd-transform
  • cargo test -p logfwd-transform csv -- --nocapture
  • cargo test -p logfwd-transform enrichment -- --nocapture
  • cargo clippy -p logfwd-transform --lib -- -D warnings
  • cargo test -p logfwd-arrow columnar -- --nocapture
  • cargo clippy -p logfwd-arrow --lib -- -D warnings
  • cargo fmt --check -p logfwd
  • cargo test -p logfwd integration::test_enrichment_join -- --nocapture
  • git diff --check

Note

Route CSV enrichment through columnar builder to produce nullable Utf8View columns

  • Rewrites read_csv_to_batch in enrichment.rs to use ColumnarBatchBuilder with a BatchPlan of Utf8View fields instead of the previous ad-hoc path.
  • Empty CSV cells are preserved as empty strings; missing trailing cells in a row are written as NULL; columns entirely absent from all rows are materialized as null Utf8View arrays.
  • Extracts read_csv_headers and validate_csv_headers helpers that reject empty header lists, blank header names, and duplicates.
  • Updates the integration test in integration.rs to downcast the joined team column as StringViewArray instead of StringArray.
  • Behavioral Change: CSV enrichment columns are now Utf8View (Arrow StringViewArray) rather than Utf8 (StringArray); downstream code that downcasts to StringArray will need updating.

Macroscope summarized 6ec9437.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 19, 2026

Walkthrough

CSV enrichment ingestion was refactored to a columnar pipeline: CSVs are parsed into nullable Arrow Utf8View (StringViewArray) columns via BatchPlan and ColumnarBatchBuilder (writing with write_str and finalizing with finish_batch). The loader now leaves missing trailing cells as NULLs, maps empty cells to empty strings, and restores header order by inserting any omitted header columns as nullable all-NULL Utf8View arrays. Header parsing/validation was centralized, a legacy test-only reader preserving prior StringArray/padding semantics was added, and tests were expanded to assert Utf8View typing and parity with legacy null/value behavior.

Possibly related PRs


Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (1 error, 2 warnings)

Check name Status Explanation Resolution
High-Quality Rust Practices ❌ Error The public method CsvFileTable::path() (line 364-366) lacks a documentation comment required by High-Quality Rust Practices guide. Add a doc comment to the path() method explaining its purpose and return value.
Documentation Thoroughly Updated ⚠️ Warning PR updated ARCHITECTURE.md, DESIGN.md, VERIFICATION.md, and CONFIG_REFERENCE.md appropriately for CSV→Utf8View refactor, but PHASES.md was not updated to mark task complete with PR #2268. Add entry to dev-docs/PHASES.md marking CSV enrichment columnar routing task complete with PR #2268 to synchronize task tracking documentation.
Maintainer Fitness ⚠️ Warning PR description lacks explicit documentation of behavioral change (Utf8 to Utf8View), consumer impact, and test coverage details required by maintainer fitness check. Update commit message to state: behavioral change (CSV now emits Utf8View), consumer impact (require StringViewArray downcast), test coverage (unit+integration), and hot-path status. Link issue #1846/#2268.
✅ Passed checks (2 passed)
Check name Status Explanation
Formal Verification Coverage ✅ Passed PR modifies logfwd-transform with only private functions; verification requirement applies only to logfwd-core. VERIFICATION.md correctly documents enrichment.rs module with justified Kani exemption.
Crate Boundary And Dependency Integrity ✅ Passed logfwd-core maintains forbid(unsafe_code) and only memchr/wide dependencies; dependency hierarchy flows strictly upward (core→arrow→transform); csv in workspace.dependencies only in logfwd-transform; binary crate is CLI/orchestration-only; no new crates added.

Comment @coderabbitai help to get the list of available commands and usage tips.

@strawgate strawgate marked this pull request as ready for review April 19, 2026 01:39
Copilot AI review requested due to automatic review settings April 19, 2026 01:39
@macroscopeapp
Copy link
Copy Markdown

macroscopeapp Bot commented Apr 19, 2026

Approvability

Verdict: Approved

This refactor routes CSV enrichment through a shared columnar builder while changing the internal string representation from Utf8 to Utf8View. The change includes explicit parity tests comparing new vs legacy output row-by-row, and documentation updates reflecting the change. The semantic behavior is preserved with good test coverage.

You can customize Macroscope's approvability policy. Learn more.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the CSV enrichment table loader in logfwd-transform to produce Arrow batches via the shared ColumnarBatchBuilder, aligning CSV production with the columnar builder path (non-OTLP) while keeping CSV parsing semantics outside logfwd-arrow.

Changes:

  • Rework read_csv_to_batch to use BatchPlan + ColumnarBatchBuilder and emit Utf8View columns.
  • Add restore_csv_header_columns to preserve declared header columns even when a column is entirely missing in the data (all-null).
  • Expand CSV-focused tests to cover missing trailing cells, empty cells, quoted commas, header preservation, and legacy value/null parity.

Comment thread crates/logfwd-transform/src/enrichment.rs Outdated
Comment thread crates/logfwd-transform/src/enrichment.rs Outdated
Comment thread crates/logfwd-transform/src/enrichment.rs Outdated
coderabbitai[bot]
coderabbitai Bot previously requested changes Apr 19, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/logfwd-transform/src/enrichment.rs`:
- Around line 485-510: The CSV header validation logic (building
csv::ReaderBuilder, collecting headers into Vec<String>, checking empty,
empty-name and duplicate names) is duplicated between the production and legacy
test loader; extract this into a shared helper (e.g.,
parse_and_validate_csv_headers or validate_csv_headers) in enrichment.rs that
returns Result<Vec<String>, TransformError> (or Result<(), TransformError> plus
the headers) and replace the inline logic around csv_reader/headers/seens with
calls to that helper from both code paths (keep error messages as
TransformError::Enrichment with the same strings).
- Around line 460-466: The loop over headers currently calls
batch.schema().fields().iter().position(|field| field.name() == header) for each
header, causing O(headers * fields) work; precompute a name→index map once
(e.g., HashMap<String, usize> or HashMap<&str, usize>) from
batch.schema().fields().iter().enumerate() before the for header in headers
loop, then replace the position(...) call with a constant-time lookup
(map.get(header)) and use the returned index for restoration; update any code
that expects Option<usize> accordingly (e.g., handle missing names the same way
as before).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 056f2ffa-3f7c-4b2d-bce0-e5701ff8b935

📥 Commits

Reviewing files that changed from the base of the PR and between 2c0a3c1 and 4250c91.

📒 Files selected for processing (1)
  • crates/logfwd-transform/src/enrichment.rs

Comment thread crates/logfwd-transform/src/enrichment.rs Outdated
Comment thread crates/logfwd-transform/src/enrichment.rs Outdated
@github-actions github-actions Bot dismissed coderabbitai[bot]’s stale review April 19, 2026 01:57

Auto-dismissed because every review thread opened from this change request is now resolved. If additional changes are still required, please leave a new review.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/logfwd-transform/src/enrichment.rs (1)

1773-1931: 🧹 Nitpick | 🔵 Trivial

Add a dedicated test for over-wide CSV rows.

The suite covers null/empty/quoted cases, but not the new record.len() > num_cols rejection path. A single a,b\n1,2,3\n case would lock that behavior down.

Proposed test
+    #[test]
+    fn csv_row_with_more_fields_than_headers_returns_error() {
+        let csv_data = b"a,b\n1,2,3\n";
+        let table = CsvFileTable::new("t", "/fake");
+        let err = table
+            .load_from_reader(&csv_data[..])
+            .expect_err("row should be rejected");
+        assert!(err.to_string().contains("has 3 fields, expected 2"));
+    }

As per coding guidelines, "Tests must be named to describe the scenario (e.g., empty_input_returns_none); one test per behavior."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-transform/src/enrichment.rs` around lines 1773 - 1931, Add a
new unit test named csv_overwide_row_rejected that exercises
CsvFileTable::load_from_reader with an over-wide row (e.g., b"a,b\n1,2,3\n") and
asserts the call returns Err; capture the error from table.load_from_reader(...)
and assert err.to_string() contains a descriptive message like "too many fields"
or "more fields than header" to lock down the rejection path implemented in
CsvFileTable::load_from_reader / CsvFileTable::new; keep the test focused on
this single behavior and follow the existing naming/style of other CSV tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@crates/logfwd-transform/src/enrichment.rs`:
- Around line 1773-1931: Add a new unit test named csv_overwide_row_rejected
that exercises CsvFileTable::load_from_reader with an over-wide row (e.g.,
b"a,b\n1,2,3\n") and asserts the call returns Err; capture the error from
table.load_from_reader(...) and assert err.to_string() contains a descriptive
message like "too many fields" or "more fields than header" to lock down the
rejection path implemented in CsvFileTable::load_from_reader /
CsvFileTable::new; keep the test focused on this single behavior and follow the
existing naming/style of other CSV tests.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: c99b1a91-b9c9-4177-aa89-d4437fd91771

📥 Commits

Reviewing files that changed from the base of the PR and between a778140 and 64bd052.

📒 Files selected for processing (1)
  • crates/logfwd-transform/src/enrichment.rs

@strawgate strawgate merged commit b6f65be into main Apr 19, 2026
20 of 24 checks passed
@strawgate strawgate deleted the codex/csv-columnar-producer branch April 19, 2026 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants