Refactor SIMD scanner to support multiple builder backends by strawgate · Pull Request #59 · strawgate/fastforward

strawgate · 2026-03-29T16:56:44Z

Summary

Refactored the SIMD scanner to decouple the scanning logic from the builder implementation by introducing a ScanBuilder trait. This enables code reuse across different builder backends while maintaining performance through inlining.

Key Changes

Introduced ScanBuilder trait: A shared interface for index-based field access that both IndexedBatchBuilder and ColumnarBatchBuilder implement, allowing the core scan loop to be builder-agnostic.
Extracted generic scan_into() function: Moved the core scanning logic into a generic function parameterized over ScanBuilder, eliminating code duplication between different scanner implementations.
Added ColumnarSimdScanner: New scanner using ColumnarBatchBuilder for zero-copy scanning with adaptive dictionary encoding and optional zstd-compressed Arrow IPC output via scan_compressed().
Refactored SimdScanner: Now uses IndexedBatchBuilder (replacing the old BatchBuilder) with bitset-tracked fields for ~3x faster builder performance via direct index-based appends instead of HashMap lookups.
Updated RecordBatchOptions: All builders now explicitly set row count when constructing RecordBatch objects for consistency.
Enhanced fuzzing: Extended the scanner fuzzer to exercise both SimdScanner and ColumnarSimdScanner across extract-all and field pushdown modes.
Fixed default config: Changed ScanConfig::keep_raw default from true to false to match typical use cases.

Implementation Details

All trait methods are marked #[inline(always)] to ensure zero-cost abstraction and allow the compiler to inline the scanning loop.
The set_buffer() method is optional (default no-op) for builders that don't need zero-copy semantics.
Both scanner implementations share identical scanning logic but differ in their builder backend and output format (uncompressed vs. compressed IPC).

https://claude.ai/code/session_01QamNN5H6VX3GE34Y28hqXJ

Replace the HashMap-based BatchBuilder in SimdScanner with the 3x faster IndexedBatchBuilder (bitset field tracking, direct index-based appends). Add ColumnarSimdScanner for zero-copy scanning with adaptive dictionary encoding and zstd-compressed Arrow IPC output. Key changes: - Define ScanBuilder trait shared by IndexedBatchBuilder and ColumnarBatchBuilder - Extract scan loop into generic scan_into<B>() to avoid code duplication - SimdScanner now uses IndexedBatchBuilder (drop-in, same public API) - New ColumnarSimdScanner with scan() and scan_compressed() methods - Default keep_raw to false (was 65% of uncompressed table size) - Fix all three builders to use RecordBatchOptions for correct row count on empty-column batches (e.g. scanning "{}" with keep_raw=false) - Add SIMD scanner + columnar scanner coverage to fuzz harness https://claude.ai/code/session_01QamNN5H6VX3GE34Y28hqXJ

…e failure https://claude.ai/code/session_01QamNN5H6VX3GE34Y28hqXJ

coderabbitai · 2026-03-29T16:56:58Z

Walkthrough

This PR refactors the JSON-to-Arrow scanning architecture by introducing a generic ScanBuilder trait that decouples scan logic from builder implementations. It adds ColumnarSimdScanner as a new public type supporting both uncompressed and zstd-compressed Arrow IPC output. All builder types now use RecordBatchOptions with explicit row counts instead of special-casing empty batches. Field-tracking support for indices ≥64 is extended in both ColumnarBuilder and IndexedBuilder with per-row "already written" flags. The dictionary encoding threshold for string columns is lowered from 256 to 128. String escape-sequence classification in ChunkIndex::new is refined to correctly handle backslash state transitions. The fuzz target adds a UTF-8 validation guard and expands test coverage to six scan executions per input across three implementations and two configuration modes.

Possibly related PRs

feat: SIMD JSON-to-Arrow scanner — direct columnar output, no DOM #6: Introduced the foundational SIMD JSON-to-Arrow scanner infrastructure and builder types being refactored in this PR
Add enrichment table framework: K8s path parser, host info, static labels #38: Modified the same fuzz target and scanner/batch construction code paths
Add fuzz testing harness (3 targets, found scanner crash) #57: Addressed fuzz-harness crashes by introducing UTF-8 validation, now formalized as the early guard in the fuzz target

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/logfwd-core/src/scanner.rs (1)
36-42: ⚠️ Potential issue | 🟠 Major

Don't change ScanConfig::default() silently.

keep_raw: false changes the public default schema for every caller still using ScanConfig::default(). We already have default-based callers in crates/logfwd-core/tests/scanner_correctness.rs:10-19 and crates/logfwd-bench/benches/pipeline.rs:107-111,209-213, so this PR changes their output shape and benchmark behavior with no compile-time signal. Please either keep the old default, or update every default-based call site and document the schema break explicitly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-core/src/scanner.rs` around lines 36 - 42, The change to
ScanConfig::default altered the public default schema by setting keep_raw:
false; revert keep_raw to its previous value (set keep_raw: true) in the
ScanConfig::default implementation to preserve the existing public behavior
(ScanConfig::default, wanted_fields, extract_all remain unchanged), or if the
schema change is intentional, update every default-based caller to construct
ScanConfig explicitly with the new keep_raw value and add a documented,
top-level breaking-change note about the schema change; ensure you modify
ScanConfig::default (the impl Default for ScanConfig) and either revert keep_raw
or update and document callers accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/logfwd-core/fuzz/fuzz_targets/scanner.rs`:
- Around line 50-85: The fuzz harness must reject non-UTF-8 inputs before
invoking the scanners because the builders (batch_builder.rs,
indexed_builder.rs, columnar_builder.rs) call from_utf8_unchecked and cause UB;
add a UTF-8 gate at the top of the fuzz_target closure (before calling
Scanner::scan, SimdScanner::scan, ColumnarSimdScanner::scan) that checks the
input with std::str::from_utf8 (or equivalent) and returns/continues early if
the check fails so only valid UTF-8 bytes reach the builders.

In `@crates/logfwd-core/src/simd_scanner.rs`:
- Around line 279-296: SimdScanner currently uses IndexedBatchBuilder which
tracks per-row writes in a u64 and will break for >64 extracted fields; add a
guard in SimdScanner::new (or right before calling scan_into in
SimdScanner::scan) that checks the expected or detected max field count and if
it can exceed 64, use a fallback builder (e.g., the previous non-indexed batch
builder) or a safe implementation that supports >64 columns instead of
IndexedBatchBuilder; reference the SimdScanner::new, SimdScanner::scan,
IndexedBatchBuilder, and finish_batch() symbols so the change replaces the
builder selection logic and prevents finish_batch() panics.
- Around line 43-85: CI fails cargo fmt on the forwarding impls for ScanBuilder
on IndexedBatchBuilder and ColumnarBatchBuilder; run rustfmt (e.g., cargo fmt)
and commit the formatted changes so the impl blocks and their methods
(begin_batch, begin_row, end_row, resolve_field, append_str_by_idx,
append_int_by_idx, append_float_by_idx, append_null_by_idx, append_raw, and
set_buffer) match rustfmt's style and satisfy cargo fmt --check.
- Around line 303-309: The dictionary tier selection currently uses
StringDictionaryBuilder::<Int8Type> for cardinality < 256 which will overflow at
128..255 (i8::MAX); update the builder choice so that when you intend to support
up to 255 entries you use StringDictionaryBuilder::<UInt8Type> instead of
Int8Type, or alternatively change the branch threshold from "< 256" to "< 128"
to remain safe with Int8Type; specifically locate the branch that picks
StringDictionaryBuilder::<Int8Type> (and any constants/variables comparing
cardinality against 256) and replace Int8Type with UInt8Type (or lower the
numeric threshold) so the chosen dictionary type matches the supported
cardinality.

---

Outside diff comments:
In `@crates/logfwd-core/src/scanner.rs`:
- Around line 36-42: The change to ScanConfig::default altered the public
default schema by setting keep_raw: false; revert keep_raw to its previous value
(set keep_raw: true) in the ScanConfig::default implementation to preserve the
existing public behavior (ScanConfig::default, wanted_fields, extract_all remain
unchanged), or if the schema change is intentional, update every default-based
caller to construct ScanConfig explicitly with the new keep_raw value and add a
documented, top-level breaking-change note about the schema change; ensure you
modify ScanConfig::default (the impl Default for ScanConfig) and either revert
keep_raw or update and document callers accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 77398bfd-ac9d-497b-9dba-0cf66402ea8d

📥 Commits

Reviewing files that changed from the base of the PR and between d5050e7 and e9184f5.

📒 Files selected for processing (7)

crates/logfwd-core/fuzz/fuzz_targets/scanner.rs
crates/logfwd-core/src/batch_builder.rs
crates/logfwd-core/src/columnar_builder.rs
crates/logfwd-core/src/indexed_builder.rs
crates/logfwd-core/src/scanner.rs
crates/logfwd-core/src/simd_scanner.rs
crates/logfwd-core/tests/scanner_conformance.proptest-regressions

coderabbitai · 2026-03-29T17:08:52Z

+pub struct SimdScanner {
+    builder: IndexedBatchBuilder,
+    config: ScanConfig,
+}
+
+impl SimdScanner {
+    pub fn new(config: ScanConfig, expected_rows: usize) -> Self {
+        SimdScanner {
+            builder: IndexedBatchBuilder::new(expected_rows, config.keep_raw),
+            config,
+        }
+    }
+
+    /// Scan a buffer of newline-delimited JSON lines and return a RecordBatch.
+    pub fn scan(&mut self, buf: &[u8]) -> RecordBatch {
+        scan_into(buf, &self.config, &mut self.builder);
+        self.builder.finish_batch()
+    }


⚠️ Potential issue | 🔴 Critical

SimdScanner now panics on batches with 65+ extracted fields.

This switches the default SIMD backend to IndexedBatchBuilder, but that builder only tracks per-row writes in a u64. After the 64th discovered field, missing-row padding stops covering later columns, so the next row that omits one of them leaves the array short and finish_batch() panics. Please add an explicit guard or a fallback backend before wiring this in as the default.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/logfwd-core/src/simd_scanner.rs` around lines 279 - 296, SimdScanner currently uses IndexedBatchBuilder which tracks per-row writes in a u64 and will break for >64 extracted fields; add a guard in SimdScanner::new (or right before calling scan_into in SimdScanner::scan) that checks the expected or detected max field count and if it can exceed 64, use a fallback builder (e.g., the previous non-indexed batch builder) or a safe implementation that supports >64 columns instead of IndexedBatchBuilder; reference the SimdScanner::new, SimdScanner::scan, IndexedBatchBuilder, and finish_batch() symbols so the change replaces the builder selection logic and prevents finish_batch() panics.

Two correctness fixes: 1. chunk_classify scalar fallback: Fix bit/buffer desync in escape handling. The old code advanced `i` by 2 on backslash but `bit` by 1, permanently misaligning all subsequent bitmask positions. Also, closing quotes were incorrectly marked as string interior. Replaced with escaped_next flag that properly carries across 64-byte block boundaries. 2. IndexedBatchBuilder + ColumnarBatchBuilder: Fix >64 field support. The u64 bitset only tracked fields 0-63. Fields 64+ had no duplicate detection (all shared bit 0) and no end-of-row NULL padding, causing column length mismatches. Added per-field written_this_row flag for overflow fields while preserving the fast bitset path for fields 0-63. Both fixes were found by the existing proptest oracle_multi_line test which generates random escape-heavy multi-line NDJSON with 64+ unique fields. https://claude.ai/code/session_01QamNN5H6VX3GE34Y28hqXJ

…ep_raw Address review findings: - Fuzz harness: add UTF-8 validation gate before invoking scanners. All builders use from_utf8_unchecked on extracted values, so non-UTF-8 fuzz input would cause UB. Return early if input fails from_utf8. - ColumnarBatchBuilder: fix Int8 dictionary overflow. The adaptive encoding used Int8Type for cardinality < 256, but i8 max index is 127. Changed threshold to < 128 to stay within i8 range. - Run cargo fmt on ScanBuilder forwarding impls. - Revert ScanConfig::default().keep_raw to true to preserve the existing public API contract. Verified against current code: - >64 field guard: already fixed in prior commit (written_this_row flag) - All 204 tests pass, 0 warnings, cargo fmt clean https://claude.ai/code/session_01QamNN5H6VX3GE34Y28hqXJ

coderabbitai

♻️ Duplicate comments (1)

crates/logfwd-core/src/simd_scanner.rs (1)

303-333: ⚠️ Potential issue | 🔴 Critical

ColumnarSimdScanner still exposes the low-cardinality dictionary overflow.

This new path is now public, but crates/logfwd-core/src/columnar_builder.rs still chooses StringDictionaryBuilder<Int8Type> for cardinality < 256. Int8Type tops out at 127 dictionary keys, so 128-255 unique strings will panic once this scanner hits that tier. Use UInt8Type for the first tier or lower the threshold below 128.

🛠️ Proposed fix

- use arrow::datatypes::{DataType, Field, Int8Type, Int16Type, Schema};
+ use arrow::datatypes::{DataType, Field, Int16Type, Schema, UInt8Type};

- /// adaptive dictionary encoding (Dict<Int8>/Dict<Int16>/plain StringArray
+ /// adaptive dictionary encoding (Dict<UInt8>/Dict<Int16>/plain StringArray

- if cardinality < 256 {
-     let mut builder = StringDictionaryBuilder::<Int8Type>::new();
+ if cardinality < 256 {
+     let mut builder = StringDictionaryBuilder::<UInt8Type>::new();
      let mut vi = 0;
      for row in 0..num_rows {
          if vi < str_values.len() && str_values[vi].row as usize == row {
              let vr = str_values[vi];
              let bytes =
                  &buf[vr.offset as usize..(vr.offset as usize + vr.len as usize)];
              let s = unsafe { std::str::from_utf8_unchecked(bytes) };
              builder.append_value(s);
              vi += 1;
          } else {
              builder.append_null();
          }
      }
      let dict_type =
-         DataType::Dictionary(Box::new(DataType::Int8), Box::new(DataType::Utf8));
+         DataType::Dictionary(Box::new(DataType::UInt8), Box::new(DataType::Utf8));
      schema_fields.push(Field::new(col_name, dict_type, true));
      arrays.push(Arc::new(builder.finish()) as ArrayRef);

For the Apache Arrow Rust crate version used by this repository, what is the maximum dictionary cardinality supported by `StringDictionaryBuilder<Int8Type>`? If the code wants a 256-entry dictionary tier, should it use `UInt8Type` instead?

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c345a27d-f5d9-494b-a587-4548349fe3d7

📥 Commits

Reviewing files that changed from the base of the PR and between e9184f5 and cf30952.

📒 Files selected for processing (4)

crates/logfwd-core/src/chunk_classify.rs
crates/logfwd-core/src/columnar_builder.rs
crates/logfwd-core/src/indexed_builder.rs
crates/logfwd-core/src/simd_scanner.rs

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/logfwd-core/src/columnar_builder.rs (1)
269-271: 🧹 Nitpick | 🔵 Trivial

Safety invariant depends on caller providing valid UTF-8.

The from_utf8_unchecked calls throughout finish_batch() (lines 270, 342, 361, 381, 399) assume the input buffer contains valid UTF-8. This is enforced by the fuzz harness, but production callers must also guarantee this. Consider adding a doc comment on finish_batch noting this requirement.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-core/src/columnar_builder.rs` around lines 269 - 271,
finish_batch currently uses unsafe std::str::from_utf8_unchecked on field
buffers (e.g., the uses with fc.name) and thus requires callers to supply valid
UTF-8; add a doc comment on the finish_batch function stating this safety
invariant (that all field/name/value buffers must be valid UTF-8) so production
callers are aware and the unsafety contract is documented, and reference the
places using from_utf8_unchecked (e.g., fc.name and the other spots in
finish_batch) in the doc comment to make the requirement explicit.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/logfwd-core/fuzz/fuzz_targets/scanner.rs`:
- Around line 81-91: Add coverage for the compressed IPC path by calling
ColumnarSimdScanner::scan_compressed(...) in the same two cases; construct or
reuse the existing compressed input (IPC + zstd) used elsewhere in tests or
compress `data` into the Arrow IPC + zstd form expected by scan_compressed, then
invoke ColumnarSimdScanner::new(make_extract_all_config(),
128).scan_compressed(compressed_data) and
ColumnarSimdScanner::new(make_pushdown_config(),
128).scan_compressed(compressed_data) and pass each result to
validate_batch(&batch, "columnar_extract_all_compressed") /
validate_batch(&batch, "columnar_pushdown_compressed") so that scan_compressed
is exercised alongside scan.

---

Outside diff comments:
In `@crates/logfwd-core/src/columnar_builder.rs`:
- Around line 269-271: finish_batch currently uses unsafe
std::str::from_utf8_unchecked on field buffers (e.g., the uses with fc.name) and
thus requires callers to supply valid UTF-8; add a doc comment on the
finish_batch function stating this safety invariant (that all field/name/value
buffers must be valid UTF-8) so production callers are aware and the unsafety
contract is documented, and reference the places using from_utf8_unchecked
(e.g., fc.name and the other spots in finish_batch) in the doc comment to make
the requirement explicit.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: a735bd97-1e5a-4f6d-946c-e65447fed8d2

📥 Commits

Reviewing files that changed from the base of the PR and between cf30952 and 3fba7ab.

📒 Files selected for processing (3)

crates/logfwd-core/fuzz/fuzz_targets/scanner.rs
crates/logfwd-core/src/columnar_builder.rs
crates/logfwd-core/src/simd_scanner.rs

coderabbitai · 2026-03-29T17:27:06Z

+    // --- Columnar SIMD Scanner ---
+    {
+        let mut scanner = ColumnarSimdScanner::new(make_extract_all_config(), 128);
+        let batch = scanner.scan(data);
+        validate_batch(&batch, "columnar_extract_all");
+    }
+    {
+        let mut scanner = ColumnarSimdScanner::new(make_pushdown_config(), 128);
+        let batch = scanner.scan(data);
+        validate_batch(&batch, "columnar_pushdown");
+    }


🧹 Nitpick | 🔵 Trivial

Consider fuzzing scan_compressed() for completeness.

The ColumnarSimdScanner::scan_compressed() path is not exercised here. While the core scanning logic is shared with scan(), the IPC compression path (zstd + Arrow IPC serialization) could have edge cases worth fuzzing.

💡 Optional: Add scan_compressed coverage

// --- Columnar SIMD Scanner --- { let mut scanner = ColumnarSimdScanner::new(make_extract_all_config(), 128); let batch = scanner.scan(data); validate_batch(&batch, "columnar_extract_all"); } { let mut scanner = ColumnarSimdScanner::new(make_pushdown_config(), 128); let batch = scanner.scan(data); validate_batch(&batch, "columnar_pushdown"); } + // Compressed IPC path + { + let mut scanner = ColumnarSimdScanner::new(make_extract_all_config(), 128); + let _ = scanner.scan_compressed(data); // validates IPC serialization doesn't panic + } });

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/logfwd-core/fuzz/fuzz_targets/scanner.rs` around lines 81 - 91, Add coverage for the compressed IPC path by calling ColumnarSimdScanner::scan_compressed(...) in the same two cases; construct or reuse the existing compressed input (IPC + zstd) used elsewhere in tests or compress `data` into the Arrow IPC + zstd form expected by scan_compressed, then invoke ColumnarSimdScanner::new(make_extract_all_config(), 128).scan_compressed(compressed_data) and ColumnarSimdScanner::new(make_pushdown_config(), 128).scan_compressed(compressed_data) and pass each result to validate_batch(&batch, "columnar_extract_all_compressed") / validate_batch(&batch, "columnar_pushdown_compressed") so that scan_compressed is exercised alongside scan.

strawgate · 2026-03-29T19:25:36Z

Superseded by #62 which combines the ScanBuilder trait wiring with the StreamingBuilder + StorageBuilder rewrite. The deferred (row, value) pattern in #62 fixes the column alignment bug that existed in the incremental null-padding approach.

claude added 2 commits March 29, 2026 16:53

chore: add proptest regression file for pre-existing oracle_multi_lin…

e9184f5

…e failure https://claude.ai/code/session_01QamNN5H6VX3GE34Y28hqXJ

coderabbitai Bot requested changes Mar 29, 2026

View reviewed changes

claude added 2 commits March 29, 2026 17:11

coderabbitai Bot reviewed Mar 29, 2026

View reviewed changes

coderabbitai Bot requested changes Mar 29, 2026

View reviewed changes

strawgate closed this Mar 29, 2026

github-actions Bot mentioned this pull request Apr 5, 2026

bench: add field pushdown scaling curve and compression level matrix #1178

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor SIMD scanner to support multiple builder backends#59

Refactor SIMD scanner to support multiple builder backends#59
strawgate wants to merge 4 commits into
masterfrom
claude/logfwd-post-simd-improvements-iq140

strawgate commented Mar 29, 2026

Uh oh!

coderabbitai Bot commented Mar 29, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Mar 29, 2026

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Mar 29, 2026

Uh oh!

strawgate commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

strawgate commented Mar 29, 2026

Summary

Key Changes

Implementation Details

Uh oh!

coderabbitai Bot commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

strawgate commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Mar 29, 2026 •

edited

Loading