refactor: route csv enrichment through columnar builder by strawgate · Pull Request #2268 · strawgate/fastforward

strawgate · 2026-04-19T01:31:22Z

Summary

Route the existing CSV enrichment producer through ColumnarBatchBuilder instead of hand-building StringArray columns.
Keep CSV delimiter, quote, header, and row-alignment semantics in logfwd-transform, outside logfwd-arrow.
Preserve CSV headers even when a column is entirely missing in the data by restoring all-null Utf8View columns after builder finalization.
Document the CSV Utf8View representation in the user config reference, architecture/design notes, verification inventory, and API comments.
Add CSV coverage for missing trailing cells, empty cells, quoted commas, all-null header preservation, and legacy value/null parity.

This addresses #1846 by proving a non-OTLP producer path through the shared builder without adding CSV parsing responsibilities to logfwd-arrow.

Verification

cargo fmt --check -p logfwd-transform
cargo test -p logfwd-transform csv -- --nocapture
cargo test -p logfwd-transform enrichment -- --nocapture
cargo clippy -p logfwd-transform --lib -- -D warnings
cargo test -p logfwd-arrow columnar -- --nocapture
cargo clippy -p logfwd-arrow --lib -- -D warnings
cargo fmt --check -p logfwd
cargo test -p logfwd integration::test_enrichment_join -- --nocapture
git diff --check

Note

Route CSV enrichment through columnar builder to produce nullable Utf8View columns

Rewrites read_csv_to_batch in enrichment.rs to use ColumnarBatchBuilder with a BatchPlan of Utf8View fields instead of the previous ad-hoc path.
Empty CSV cells are preserved as empty strings; missing trailing cells in a row are written as NULL; columns entirely absent from all rows are materialized as null Utf8View arrays.
Extracts read_csv_headers and validate_csv_headers helpers that reject empty header lists, blank header names, and duplicates.
Updates the integration test in integration.rs to downcast the joined team column as StringViewArray instead of StringArray.
Behavioral Change: CSV enrichment columns are now Utf8View (Arrow StringViewArray) rather than Utf8 (StringArray); downstream code that downcasts to StringArray will need updating.

^{Macroscope summarized 6ec9437.}

coderabbitai · 2026-04-19T01:31:28Z

Walkthrough

CSV enrichment ingestion was refactored to a columnar pipeline: CSVs are parsed into nullable Arrow Utf8View (StringViewArray) columns via BatchPlan and ColumnarBatchBuilder (writing with write_str and finalizing with finish_batch). The loader now leaves missing trailing cells as NULLs, maps empty cells to empty strings, and restores header order by inserting any omitted header columns as nullable all-NULL Utf8View arrays. Header parsing/validation was centralized, a legacy test-only reader preserving prior StringArray/padding semantics was added, and tests were expanded to assert Utf8View typing and parity with legacy null/value behavior.

Possibly related PRs

feat: enrichment tables, processors, and CSV geo backend #2089: Modifies CSV/enrichment codepath and schema behavior in the same enrichment.rs area, indicating a strong code-level connection to the CSV ingestion changes.

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

Ignore

❌ Failed checks (1 error, 2 warnings)

Check name	Status	Explanation	Resolution
High-Quality Rust Practices	❌ Error	The public method `CsvFileTable::path()` (line 364-366) lacks a documentation comment required by High-Quality Rust Practices guide.	Add a doc comment to the `path()` method explaining its purpose and return value.
Documentation Thoroughly Updated	⚠️ Warning	PR updated ARCHITECTURE.md, DESIGN.md, VERIFICATION.md, and CONFIG_REFERENCE.md appropriately for CSV→Utf8View refactor, but PHASES.md was not updated to mark task complete with PR `#2268`.	Add entry to dev-docs/PHASES.md marking CSV enrichment columnar routing task complete with PR `#2268` to synchronize task tracking documentation.
Maintainer Fitness	⚠️ Warning	PR description lacks explicit documentation of behavioral change (Utf8 to Utf8View), consumer impact, and test coverage details required by maintainer fitness check.	Update commit message to state: behavioral change (CSV now emits Utf8View), consumer impact (require StringViewArray downcast), test coverage (unit+integration), and hot-path status. Link issue `#1846/`#2268.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Formal Verification Coverage	✅ Passed	PR modifies logfwd-transform with only private functions; verification requirement applies only to logfwd-core. VERIFICATION.md correctly documents enrichment.rs module with justified Kani exemption.
Crate Boundary And Dependency Integrity	✅ Passed	logfwd-core maintains forbid(unsafe_code) and only memchr/wide dependencies; dependency hierarchy flows strictly upward (core→arrow→transform); csv in workspace.dependencies only in logfwd-transform; binary crate is CLI/orchestration-only; no new crates added.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

macroscopeapp · 2026-04-19T01:40:10Z

Approvability

Verdict: Approved

This refactor routes CSV enrichment through a shared columnar builder while changing the internal string representation from Utf8 to Utf8View. The change includes explicit parity tests comparing new vs legacy output row-by-row, and documentation updates reflecting the change. The semantic behavior is preserved with good test coverage.

^{You can customize Macroscope's approvability policy. Learn more.}

Copilot

Pull request overview

This PR refactors the CSV enrichment table loader in logfwd-transform to produce Arrow batches via the shared ColumnarBatchBuilder, aligning CSV production with the columnar builder path (non-OTLP) while keeping CSV parsing semantics outside logfwd-arrow.

Changes:

Rework read_csv_to_batch to use BatchPlan + ColumnarBatchBuilder and emit Utf8View columns.
Add restore_csv_header_columns to preserve declared header columns even when a column is entirely missing in the data (all-null).
Expand CSV-focused tests to cover missing trailing cells, empty cells, quoted commas, header preservation, and legacy value/null parity.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/logfwd-transform/src/enrichment.rs`:
- Around line 485-510: The CSV header validation logic (building
csv::ReaderBuilder, collecting headers into Vec<String>, checking empty,
empty-name and duplicate names) is duplicated between the production and legacy
test loader; extract this into a shared helper (e.g.,
parse_and_validate_csv_headers or validate_csv_headers) in enrichment.rs that
returns Result<Vec<String>, TransformError> (or Result<(), TransformError> plus
the headers) and replace the inline logic around csv_reader/headers/seens with
calls to that helper from both code paths (keep error messages as
TransformError::Enrichment with the same strings).
- Around line 460-466: The loop over headers currently calls
batch.schema().fields().iter().position(|field| field.name() == header) for each
header, causing O(headers * fields) work; precompute a name→index map once
(e.g., HashMap<String, usize> or HashMap<&str, usize>) from
batch.schema().fields().iter().enumerate() before the for header in headers
loop, then replace the position(...) call with a constant-time lookup
(map.get(header)) and use the returned index for restoration; update any code
that expects Option<usize> accordingly (e.g., handle missing names the same way
as before).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 056f2ffa-3f7c-4b2d-bce0-e5701ff8b935

📥 Commits

Reviewing files that changed from the base of the PR and between 2c0a3c1 and 4250c91.

📒 Files selected for processing (1)

crates/logfwd-transform/src/enrichment.rs

Auto-dismissed because every review thread opened from this change request is now resolved. If additional changes are still required, please leave a new review.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/logfwd-transform/src/enrichment.rs (1)

1773-1931: 🧹 Nitpick | 🔵 Trivial

Add a dedicated test for over-wide CSV rows.

The suite covers null/empty/quoted cases, but not the new record.len() > num_cols rejection path. A single a,b\n1,2,3\n case would lock that behavior down.

Proposed test

+    #[test]
+    fn csv_row_with_more_fields_than_headers_returns_error() {
+        let csv_data = b"a,b\n1,2,3\n";
+        let table = CsvFileTable::new("t", "/fake");
+        let err = table
+            .load_from_reader(&csv_data[..])
+            .expect_err("row should be rejected");
+        assert!(err.to_string().contains("has 3 fields, expected 2"));
+    }

As per coding guidelines, "Tests must be named to describe the scenario (e.g., empty_input_returns_none); one test per behavior."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-transform/src/enrichment.rs` around lines 1773 - 1931, Add a
new unit test named csv_overwide_row_rejected that exercises
CsvFileTable::load_from_reader with an over-wide row (e.g., b"a,b\n1,2,3\n") and
asserts the call returns Err; capture the error from table.load_from_reader(...)
and assert err.to_string() contains a descriptive message like "too many fields"
or "more fields than header" to lock down the rejection path implemented in
CsvFileTable::load_from_reader / CsvFileTable::new; keep the test focused on
this single behavior and follow the existing naming/style of other CSV tests.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@crates/logfwd-transform/src/enrichment.rs`:
- Around line 1773-1931: Add a new unit test named csv_overwide_row_rejected
that exercises CsvFileTable::load_from_reader with an over-wide row (e.g.,
b"a,b\n1,2,3\n") and asserts the call returns Err; capture the error from
table.load_from_reader(...) and assert err.to_string() contains a descriptive
message like "too many fields" or "more fields than header" to lock down the
rejection path implemented in CsvFileTable::load_from_reader /
CsvFileTable::new; keep the test focused on this single behavior and follow the
existing naming/style of other CSV tests.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: c99b1a91-b9c9-4177-aa89-d4437fd91771

📥 Commits

Reviewing files that changed from the base of the PR and between a778140 and 64bd052.

📒 Files selected for processing (1)

crates/logfwd-transform/src/enrichment.rs

refactor: route csv enrichment through columnar builder

4250c91

strawgate marked this pull request as ready for review April 19, 2026 01:39

Copilot AI review requested due to automatic review settings April 19, 2026 01:39

Copilot started reviewing on behalf of strawgate April 19, 2026 01:39 View session

Copilot AI reviewed Apr 19, 2026

View reviewed changes

Comment thread crates/logfwd-transform/src/enrichment.rs Outdated

Comment thread crates/logfwd-transform/src/enrichment.rs Outdated

Comment thread crates/logfwd-transform/src/enrichment.rs Outdated

coderabbitai Bot previously requested changes Apr 19, 2026

View reviewed changes

Comment thread crates/logfwd-transform/src/enrichment.rs Outdated

Comment thread crates/logfwd-transform/src/enrichment.rs Outdated

strawgate added 2 commits April 18, 2026 20:46

test: update enrichment join for utf8view csv columns

a778140

refactor: share csv header validation

64bd052

docs: document csv utf8view enrichment output

6ec9437

coderabbitai Bot reviewed Apr 19, 2026

View reviewed changes

coderabbitai Bot approved these changes Apr 19, 2026

View reviewed changes

macroscopeapp Bot approved these changes Apr 19, 2026

View reviewed changes

strawgate merged commit b6f65be into main Apr 19, 2026
20 of 24 checks passed

strawgate deleted the codex/csv-columnar-producer branch April 19, 2026 02:19

strawgate mentioned this pull request Apr 19, 2026

work-unit: csv input — prove non-OTLP ColumnarBatchBuilder producer #1846

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: route csv enrichment through columnar builder#2268

refactor: route csv enrichment through columnar builder#2268
strawgate merged 4 commits into
mainfrom
codex/csv-columnar-producer

strawgate commented Apr 19, 2026 •

edited by macroscopeapp Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 19, 2026 •

edited

Loading

Pre-merge checks failed

Uh oh!

macroscopeapp Bot commented Apr 19, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

strawgate commented Apr 19, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Route CSV enrichment through columnar builder to produce nullable Utf8View columns

Uh oh!

coderabbitai Bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Possibly related PRs

Pre-merge checks failed

❌ Failed checks (1 error, 2 warnings)

Uh oh!

macroscopeapp Bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approvability

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

strawgate commented Apr 19, 2026 •

edited by macroscopeapp Bot

Loading

coderabbitai Bot commented Apr 19, 2026 •

edited

Loading

macroscopeapp Bot commented Apr 19, 2026 •

edited

Loading