refactor: route csv enrichment through columnar builder#2268
Conversation
WalkthroughCSV enrichment ingestion was refactored to a columnar pipeline: CSVs are parsed into nullable Arrow Possibly related PRs
Caution Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional.
❌ Failed checks (1 error, 2 warnings)
✅ Passed checks (2 passed)
Comment |
ApprovabilityVerdict: Approved This refactor routes CSV enrichment through a shared columnar builder while changing the internal string representation from Utf8 to Utf8View. The change includes explicit parity tests comparing new vs legacy output row-by-row, and documentation updates reflecting the change. The semantic behavior is preserved with good test coverage. You can customize Macroscope's approvability policy. Learn more. |
There was a problem hiding this comment.
Pull request overview
This PR refactors the CSV enrichment table loader in logfwd-transform to produce Arrow batches via the shared ColumnarBatchBuilder, aligning CSV production with the columnar builder path (non-OTLP) while keeping CSV parsing semantics outside logfwd-arrow.
Changes:
- Rework
read_csv_to_batchto useBatchPlan+ColumnarBatchBuilderand emitUtf8Viewcolumns. - Add
restore_csv_header_columnsto preserve declared header columns even when a column is entirely missing in the data (all-null). - Expand CSV-focused tests to cover missing trailing cells, empty cells, quoted commas, header preservation, and legacy value/null parity.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@crates/logfwd-transform/src/enrichment.rs`:
- Around line 485-510: The CSV header validation logic (building
csv::ReaderBuilder, collecting headers into Vec<String>, checking empty,
empty-name and duplicate names) is duplicated between the production and legacy
test loader; extract this into a shared helper (e.g.,
parse_and_validate_csv_headers or validate_csv_headers) in enrichment.rs that
returns Result<Vec<String>, TransformError> (or Result<(), TransformError> plus
the headers) and replace the inline logic around csv_reader/headers/seens with
calls to that helper from both code paths (keep error messages as
TransformError::Enrichment with the same strings).
- Around line 460-466: The loop over headers currently calls
batch.schema().fields().iter().position(|field| field.name() == header) for each
header, causing O(headers * fields) work; precompute a name→index map once
(e.g., HashMap<String, usize> or HashMap<&str, usize>) from
batch.schema().fields().iter().enumerate() before the for header in headers
loop, then replace the position(...) call with a constant-time lookup
(map.get(header)) and use the returned index for restoration; update any code
that expects Option<usize> accordingly (e.g., handle missing names the same way
as before).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 056f2ffa-3f7c-4b2d-bce0-e5701ff8b935
📒 Files selected for processing (1)
crates/logfwd-transform/src/enrichment.rs
Auto-dismissed because every review thread opened from this change request is now resolved. If additional changes are still required, please leave a new review.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/logfwd-transform/src/enrichment.rs (1)
1773-1931: 🧹 Nitpick | 🔵 TrivialAdd a dedicated test for over-wide CSV rows.
The suite covers null/empty/quoted cases, but not the new
record.len() > num_colsrejection path. A singlea,b\n1,2,3\ncase would lock that behavior down.Proposed test
+ #[test] + fn csv_row_with_more_fields_than_headers_returns_error() { + let csv_data = b"a,b\n1,2,3\n"; + let table = CsvFileTable::new("t", "/fake"); + let err = table + .load_from_reader(&csv_data[..]) + .expect_err("row should be rejected"); + assert!(err.to_string().contains("has 3 fields, expected 2")); + }As per coding guidelines, "Tests must be named to describe the scenario (e.g.,
empty_input_returns_none); one test per behavior."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@crates/logfwd-transform/src/enrichment.rs` around lines 1773 - 1931, Add a new unit test named csv_overwide_row_rejected that exercises CsvFileTable::load_from_reader with an over-wide row (e.g., b"a,b\n1,2,3\n") and asserts the call returns Err; capture the error from table.load_from_reader(...) and assert err.to_string() contains a descriptive message like "too many fields" or "more fields than header" to lock down the rejection path implemented in CsvFileTable::load_from_reader / CsvFileTable::new; keep the test focused on this single behavior and follow the existing naming/style of other CSV tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@crates/logfwd-transform/src/enrichment.rs`:
- Around line 1773-1931: Add a new unit test named csv_overwide_row_rejected
that exercises CsvFileTable::load_from_reader with an over-wide row (e.g.,
b"a,b\n1,2,3\n") and asserts the call returns Err; capture the error from
table.load_from_reader(...) and assert err.to_string() contains a descriptive
message like "too many fields" or "more fields than header" to lock down the
rejection path implemented in CsvFileTable::load_from_reader /
CsvFileTable::new; keep the test focused on this single behavior and follow the
existing naming/style of other CSV tests.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Organization UI (inherited)
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: c99b1a91-b9c9-4177-aa89-d4437fd91771
📒 Files selected for processing (1)
crates/logfwd-transform/src/enrichment.rs
Summary
ColumnarBatchBuilderinstead of hand-buildingStringArraycolumns.logfwd-transform, outsidelogfwd-arrow.Utf8Viewrepresentation in the user config reference, architecture/design notes, verification inventory, and API comments.This addresses #1846 by proving a non-OTLP producer path through the shared builder without adding CSV parsing responsibilities to
logfwd-arrow.Verification
cargo fmt --check -p logfwd-transformcargo test -p logfwd-transform csv -- --nocapturecargo test -p logfwd-transform enrichment -- --nocapturecargo clippy -p logfwd-transform --lib -- -D warningscargo test -p logfwd-arrow columnar -- --nocapturecargo clippy -p logfwd-arrow --lib -- -D warningscargo fmt --check -p logfwdcargo test -p logfwd integration::test_enrichment_join -- --nocapturegit diff --checkNote
Route CSV enrichment through columnar builder to produce nullable Utf8View columns
read_csv_to_batchin enrichment.rs to useColumnarBatchBuilderwith aBatchPlanofUtf8Viewfields instead of the previous ad-hoc path.Utf8Viewarrays.read_csv_headersandvalidate_csv_headershelpers that reject empty header lists, blank header names, and duplicates.teamcolumn asStringViewArrayinstead ofStringArray.Utf8View(ArrowStringViewArray) rather thanUtf8(StringArray); downstream code that downcasts toStringArraywill need updating.Macroscope summarized 6ec9437.