refactor: delete StorageBuilder, unify on Scanner with scan()/scan_detached() by strawgate · Pull Request #962 · strawgate/fastforward

strawgate · 2026-04-04T21:36:20Z

Summary

Delete StorageBuilder and CopyScanner entirely (-1041 net lines)
Rename ZeroCopyScanner → Scanner (only one scanner now, qualifier is noise)
Rename scan_owned() → scan_detached(), finish_batch_owned() → finish_batch_detached() (consistent with detach module vocabulary)
Migrate all 45 files: tests, benchmarks, fuzz targets, examples, docs

Follows #941 which proved finish_batch_detached() dominates StorageBuilder at every benchmark.

Architecture after this PR

pub struct Scanner { /* StreamingBuilder inside */ }

impl Scanner {
    pub fn scan(&mut self, buf: Bytes) -> RecordBatch       // StringViewArray, zero-copy
    pub fn scan_detached(&mut self, buf: Bytes) -> RecordBatch  // StringArray, self-contained
}

One builder (StreamingBuilder), two finish modes:

finish_batch() → StringViewArray — views into input buffer (hot/wire path)
finish_batch_detached() → StringArray — bulk copy at finalization (persistence path)

Test plan

cargo build --workspace clean
cargo test --workspace — 1,002 tests pass
cargo clippy --workspace -- -D warnings clean
cargo fmt --all --check clean
Zero StorageBuilder/CopyScanner/ZeroCopyScanner/scan_owned/finish_batch_owned references in .rs or .md
CI green

🤖 Generated with Claude Code

Note

Delete `StorageBuilder`, `CopyScanner`, and `ZeroCopyScanner` in favor of a unified `Scanner` with `scan()`/`scan_detached()`

Removes StorageBuilder and the two separate scanner types (CopyScanner, ZeroCopyScanner) from logfwd-arrow, replacing them with a single Scanner that exposes two methods: scan() for zero-copy StringViewArray output and scan_detached() for owned StringArray output.
Renames finish_batch_owned to finish_batch_detached on StreamingBuilder to align with the new naming convention; no functional logic changes.
Updates all call sites across benchmarks, fuzz targets, tests, examples, and production code to use Scanner::new() with the appropriate method.
Risk: StorageBuilder, CopyScanner, and ZeroCopyScanner are no longer exported from logfwd-arrow; any code referencing these symbols will not compile.

^{Macroscope summarized f500df7.}

coderabbitai · 2026-04-04T21:36:34Z

Caution

Review failed

Pull request was closed or merged during review

Walkthrough

This PR consolidates dual-scanner infrastructure (CopyScanner backed by StorageBuilder and ZeroCopyScanner backed by StreamingBuilder) into a single Scanner type backed by StreamingBuilder. The StorageBuilder module is removed entirely. Two scan modes are now exposed through Scanner: scan(bytes::Bytes) produces zero-copy StringViewArray columns backed by the input buffer, while scan_detached(bytes::Bytes) produces self-contained StringArray columns with owned data. StreamingBuilder::finish_batch_owned() is renamed to finish_batch_detached(). Documentation is updated to describe scan modes rather than builder types. All benchmarks, tests, examples, and implementations are migrated to use the unified Scanner API.

Possibly related PRs

feat: dual-output StreamingBuilder — finish_batch_owned() for persistence path #941: Implements the dual-output StreamingBuilder pattern with finish_batch_detached() and scan_detached() methods that directly parallel this PR's approach to owned vs zero-copy finalization.
feat: suffix column names only on type conflict, delete dead rewriter (#445) #684: Modifies the same streaming_builder.rs and storage_builder.rs files and adjusts finish_batch logic, with overlapping changes to builder column-emission behavior.
feat: create logfwd-arrow crate, move builders + scanner structs (Step 1) #307: Relocates and reshapes Scanner, StorageBuilder, and StreamingBuilder types across the arrow-facing crate modules, directly overlapping with this consolidation refactor.

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

Ignore

❌ Failed checks (2 errors, 2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
High-Quality Rust Practices	❌ Error	PR violates guideline `#2` by including multiple .clone()/allocation calls in hot-path benchmark loops without required justification comments.	Move per-iteration allocations outside benchmark loops or add explicit // PERF: allocation justified because [reason] comments at each flagged location.
Crate Boundary And Dependency Integrity	❌ Error	PR violates workspace dependency management by adding bytes=1 to 6 individual crates without centralizing in workspace.dependencies	Move bytes={version=1} to workspace.dependencies in root Cargo.toml, update crates to use bytes={workspace=true}, and justify logfwd-transform production dependency usage.
Documentation Thoroughly Updated	⚠️ Warning	VERIFICATION.md contains stale reference to deleted storage_builder.rs module in proof requirements table.	Remove the logfwd-arrow/storage_builder.rs row from VERIFICATION.md module table as it has no verification coverage.
Maintainer Fitness	⚠️ Warning	PR fails Maintainer Fitness criteria: red CI blocking merge, hot-path benchmarks contain per-iteration allocations violating guidelines, integration test uses non-production path missing zero-copy regressions, documentation gaps on API semantics and historical accuracy.	Fix red CI job. Remove per-iteration allocations from scanner.rs, pipeline.rs, es_throughput.rs benchmarks. Update integration.rs to test production zero-copy path. Restore Phase 10b history accuracy. Clarify scanner.md distinguishing scan() vs scan_detached(). Confirm with benchmark results.
Formal Verification Coverage	❓ Inconclusive	PR involves primarily function renames (ZeroCopyScanner→Scanner, finish_batch_owned→finish_batch_detached, scan_owned→scan_detached) rather than new public functions. VERIFICATION.md file location could not be confirmed, and existing Kani proof status for renamed functions is unclear.	Verify VERIFICATION.md exists and was updated; confirm Kani proofs exist for renamed functions; ensure Proptest covers Scanner dual-mode behavior across boundary conditions; document exemption rationale if applicable.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

macroscopeapp · 2026-04-04T21:38:49Z

Approvability

Verdict: Needs human review

This is a major refactor that deletes the entire StorageBuilder implementation (~1000 lines) and consolidates two scanner types (CopyScanner, ZeroCopyScanner) into a unified Scanner with scan()/scan_detached() modes. While the changes are mechanically straightforward, the API surface change and code path consolidation warrant human review to verify the functionality is properly preserved.

^{You can customize Macroscope's approvability policy. Learn more.}

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/logfwd-core/fuzz/fuzz_targets/scanner_consistency.rs (1)
37-50: ⚠️ Potential issue | 🟠 Major

This still lets schema and conflict-column drift slip through.

Comparing BTreeSets throws away column order, and the _ => {} arm silently accepts StructArray conflict columns plus any other unexpected type pair. That means scan() and scan_owned() can diverge on mixed-type fields or schema layout while this target still passes. Compare ordered fields, recurse into DataType::Struct(_) children, and treat any unhandled type pairing as a fuzz failure.

Based on learnings: Applies to crates/logfwd-arrow/src/**/*.rs : Use bare names for single-type fields and StructArray for column naming conflicts.

Also applies to: 53-114
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-core/fuzz/fuzz_targets/scanner_consistency.rs` around lines 37
- 50, The current equality check uses BTreeSet of column names which loses
ordering and misses StructArray vs bare-type mismatches; replace the set
comparison with an ordered, index-by-index comparison of
owned_batch.schema().fields() vs streaming_batch.schema().fields(), and when a
field's DataType is a Struct recurse into its children to compare nested field
names/types; treat any unexpected or unhandled type pairing (e.g., one side
StructArray and the other a concrete type) as a fuzz failure (panic/assert) so
scan() and scan_owned() divergences fail the target; apply the same
ordered/composite checks for the analogous block covering lines ~53-114 and
follow the convention of using bare names for single-type fields and StructArray
for conflict columns.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/logfwd-arrow/tests/allocation_regression.rs`:
- Around line 53-57: The test is measuring allocation regressions but currently
does heavy input preparation by calling bytes::Bytes::from(data.clone()) inside
the measurement windows; move the expensive allocation out of the Region::new()
measured blocks by creating a single bytes::Bytes value from data (e.g., let
input = bytes::Bytes::from(data.clone()) or construct input once without
including it in the timed region) before each measured region, and then inside
the measured loops call scanner.scan_owned(input.clone()).unwrap() so only the
cheap refcounted clone happens during measurement; update every measured block
that currently uses scanner.scan_owned(bytes::Bytes::from(data.clone())) to use
the pre-allocated input and clone that handle instead.

In `@crates/logfwd-bench/benches/pipeline.rs`:
- Line 14: The import has a duplicated "Streaming" identifier: replace the
incorrect use of StreamingStreamingSimdScanner with the correct type name
StreamingSimdScanner in the use statement and update any other references to
StreamingStreamingSimdScanner in this file (e.g., variable types, instantiations
or pattern matches) to StreamingSimdScanner so the code compiles.

In `@crates/logfwd-bench/src/rss.rs`:
- Around line 95-97: The RSS benchmark is biased because the owned path clones
the full Vec<u8> while the streaming path moves it; instead, convert the input
Vec<u8> into a Bytes once and then clone that Bytes handle (cheap refcount
increment) for both the scan_owned and scan paths so both measure the same input
cost; update the setup used by scan_owned and scan (referencing the scan_owned
and scan calls and the variables creating the input Bytes) to create bytes_input
= Bytes::from(vec) once and use bytes_input.clone() where the owned path
previously cloned the Vec<u8>.

In `@crates/logfwd-core/fuzz/fuzz_targets/scanner_consistency.rs`:
- Around line 22-26: The fuzz target currently returns early when either
StreamingSimdScanner::scan_owned or ::scan returns Err, but doesn't check that
the other call produced the same Result; ensure parity by calling both (or
capturing both Results first) and asserting that both are Ok or both are Err
before proceeding—use the StreamingSimdScanner instances/variables
(owned_scanner, streaming_scanner) and their results (owned_batch,
streaming_batch) to compare outcomes and if one is Ok while the other is Err,
cause the target to fail (panic or unwrap) so the mismatch is reported.

In `@crates/logfwd-core/tests/it/compliance_data.rs`:
- Around line 725-726: Update the inline comment to reflect that
StreamingBuilder::begin_batch() resets field_index (it does not persist across
batches); change the wording near the test around field_index to state that
begin_batch clears field_index so the column still exists for schema stability
but its value should be null at batch start. Reference
StreamingBuilder::begin_batch() and the field_index behavior when editing the
comment.

In `@dev-docs/ARCHITECTURE.md`:
- Around line 162-175: Doc text incorrectly claims finish_batch() and
StreamingSimdScanner::scan(Bytes) are always zero-copy; update the wording to
qualify that finish_batch() is zero-copy only when no decoded strings exist and
that when decoded_buf is populated the implementation builds a combined buffer
(causing a copy) before creating the StringViewArray; change the description of
StreamingSimdScanner::scan(Bytes) to state it returns a zero-copy RecordBatch
only if no decoded strings are present, whereas
StreamingSimdScanner::scan_owned(Bytes) still produces an owned StringArray via
finish_batch_owned() for persistence/compression.

---

Outside diff comments:
In `@crates/logfwd-core/fuzz/fuzz_targets/scanner_consistency.rs`:
- Around line 37-50: The current equality check uses BTreeSet of column names
which loses ordering and misses StructArray vs bare-type mismatches; replace the
set comparison with an ordered, index-by-index comparison of
owned_batch.schema().fields() vs streaming_batch.schema().fields(), and when a
field's DataType is a Struct recurse into its children to compare nested field
names/types; treat any unexpected or unhandled type pairing (e.g., one side
StructArray and the other a concrete type) as a fuzz failure (panic/assert) so
scan() and scan_owned() divergences fail the target; apply the same
ordered/composite checks for the analogous block covering lines ~53-114 and
follow the convention of using bare names for single-type fields and StructArray
for conflict columns.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8424e577-003d-44e3-81bf-6cf987f8e129

📥 Commits

Reviewing files that changed from the base of the PR and between 91b628e and 2d41c02.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (37)

DEVELOPING.md
book/src/architecture/pipeline.md
book/src/development/contributing.md
crates/logfwd-arrow/README.md
crates/logfwd-arrow/src/conflict_schema.rs
crates/logfwd-arrow/src/lib.rs
crates/logfwd-arrow/src/scanner.rs
crates/logfwd-arrow/src/storage_builder.rs
crates/logfwd-arrow/tests/allocation_regression.rs
crates/logfwd-bench/benches/pipeline.rs
crates/logfwd-bench/src/e2e_profile.rs
crates/logfwd-bench/src/es_throughput.rs
crates/logfwd-bench/src/explore.rs
crates/logfwd-bench/src/rss.rs
crates/logfwd-bench/src/sizes.rs
crates/logfwd-core/benches/scanner.rs
crates/logfwd-core/examples/arrow_ipc_roundtrip.rs
crates/logfwd-core/fuzz/fuzz_targets/scanner.rs
crates/logfwd-core/fuzz/fuzz_targets/scanner_consistency.rs
crates/logfwd-core/fuzz/fuzz_targets/scanner_sink.rs
crates/logfwd-core/fuzz/fuzz_targets/scanner_transform.rs
crates/logfwd-core/fuzz/fuzz_targets/structural_index.rs
crates/logfwd-core/src/scan_config.rs
crates/logfwd-core/src/scanner.rs
crates/logfwd-core/tests/it/compliance_data.rs
crates/logfwd-core/tests/it/scanner_conformance.rs
crates/logfwd/Cargo.toml
crates/logfwd/src/pipeline.rs
crates/logfwd/tests/it/integration.rs
dev-docs/ARCHITECTURE.md
dev-docs/DESIGN.md
dev-docs/SCANNER_CONTRACT.md
dev-docs/VERIFICATION.md
dev-docs/research/arrow-ipc-feasibility.md
dev-docs/research/type-suffix-redesign.md
docs/ci/documentation-thoroughly-updated.md
docs/ci/high-quality-rust-practices.md

💤 Files with no reviewable changes (1)

crates/logfwd-arrow/src/storage_builder.rs

coderabbitai · 2026-04-04T21:50:24Z

+        drop(
+            scanner
+                .scan_owned(bytes::Bytes::from(data.clone()))
+                .unwrap(),
+        );


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify measured scan_owned loops are cloning Vec input in-place. rg -n -C2 'scan_owned\(bytes::Bytes::from\([a-zA-Z0-9_]+\.clone\(\)\)\)' crates/logfwd-arrow/tests/allocation_regression.rs

Repository: strawgate/memagent

Length of output: 912

🏁 Script executed:

# Read context around the allocation test to confirm measurement setup sed -n '40,85p' crates/logfwd-arrow/tests/allocation_regression.rs

Repository: strawgate/memagent

Length of output: 1168

🏁 Script executed:

# Also check the latter measurement sites sed -n '120,145p' crates/logfwd-arrow/tests/allocation_regression.rs

Repository: strawgate/memagent

Length of output: 963

🏁 Script executed:

# Verify bytes::Bytes is indeed refcounted/cheap clone rg -A5 'pub struct Bytes' --type rust

Repository: strawgate/memagent

Length of output: 44

Allocation regression measurements are polluted by input cloning inside measurement windows.

Lines 53–57, 62–66, 72–76, 129–131, 136–138: data.clone() allocates per iteration inside Region::new() measurement blocks. The test should measure scanner allocation behavior, not input preparation cost. Pre-allocate bytes::Bytes once before measured regions and clone the refcounted handle inside loops (cheap operation).

Proposed fix

fn owned_scanner_no_leak_across_batches() { let mut scanner = StreamingSimdScanner::new(ScanConfig::default()); - let data = make_ndjson(500); + let input: bytes::Bytes = make_ndjson(500).into(); for _ in 0..5 { - drop( - scanner - .scan_owned(bytes::Bytes::from(data.clone())) - .unwrap(), - ); + drop(scanner.scan_owned(input.clone()).unwrap()); } let reg1 = Region::new(GLOBAL); for _ in 0..10 { - drop( - scanner - .scan_owned(bytes::Bytes::from(data.clone())) - .unwrap(), - ); + drop(scanner.scan_owned(input.clone()).unwrap()); } let reg2 = Region::new(GLOBAL); for _ in 0..10 { - drop( - scanner - .scan_owned(bytes::Bytes::from(data.clone())) - .unwrap(), - ); + drop(scanner.scan_owned(input.clone()).unwrap()); }

- let data_500 = make_ndjson(500); + let data_500: bytes::Bytes = make_ndjson(500).into(); let reg_500 = Region::new(GLOBAL); let _ = scanner - .scan_owned(bytes::Bytes::from(data_500.clone())) + .scan_owned(data_500.clone()) .unwrap(); - let data_5000 = make_ndjson(5000); + let data_5000: bytes::Bytes = make_ndjson(5000).into(); let reg_5000 = Region::new(GLOBAL); let _ = scanner - .scan_owned(bytes::Bytes::from(data_5000.clone())) + .scan_owned(data_5000.clone()) .unwrap();

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

drop(

scanner

.scan_owned(bytes::Bytes::from(data.clone()))

.unwrap(),

);

drop(scanner.scan_owned(input.clone()).unwrap());

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/logfwd-arrow/tests/allocation_regression.rs` around lines 53 - 57, The test is measuring allocation regressions but currently does heavy input preparation by calling bytes::Bytes::from(data.clone()) inside the measurement windows; move the expensive allocation out of the Region::new() measured blocks by creating a single bytes::Bytes value from data (e.g., let input = bytes::Bytes::from(data.clone()) or construct input once without including it in the timed region) before each measured region, and then inside the measured loops call scanner.scan_owned(input.clone()).unwrap() so only the cheap refcounted clone happens during measurement; update every measured block that currently uses scanner.scan_owned(bytes::Bytes::from(data.clone())) to use the pre-allocated input and clone that handle instead.

coderabbitai · 2026-04-04T21:50:25Z


-use logfwd_arrow::scanner::SimdScanner;
+use bytes::Bytes;
+use logfwd_arrow::scanner::StreamingStreamingSimdScanner;


⚠️ Potential issue | 🔴 Critical

Typo: StreamingStreamingSimdScanner should be StreamingSimdScanner.

Duplicated "Streaming" prefix will cause compilation failure.

🐛 Fix

-use logfwd_arrow::scanner::StreamingStreamingSimdScanner; +use logfwd_arrow::scanner::StreamingSimdScanner;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

use logfwd_arrow::scanner::StreamingStreamingSimdScanner;

use logfwd_arrow::scanner::StreamingSimdScanner;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/logfwd-bench/benches/pipeline.rs` at line 14, The import has a duplicated "Streaming" identifier: replace the incorrect use of StreamingStreamingSimdScanner with the correct type name StreamingSimdScanner in the use statement and update any other references to StreamingStreamingSimdScanner in this file (e.g., variable types, instantiations or pattern matches) to StreamingSimdScanner so the code compiles.

coderabbitai · 2026-04-04T21:50:25Z

+    // === scan_owned (copies strings) vs scan (zero-copy) ===
+    println!("--- scan_owned vs scan (1M simple lines) ---");
    {


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify scan_owned path clones full input while scan path does not in RSS comparison block. rg -n -C3 'scan_owned\(bytes::Bytes::from\(data.clone\(\)\)\)|scan\(bytes::Bytes::from\(data\)\)' crates/logfwd-bench/src/rss.rs

Repository: strawgate/memagent

Length of output: 855

🏁 Script executed:

# Get full context of the benchmark block and function definitions sed -n '90,130p' crates/logfwd-bench/src/rss.rs

Repository: strawgate/memagent

Length of output: 2004

🏁 Script executed:

# Find generate_simple function definition rg -n 'fn generate_simple' crates/logfwd-bench/src/rss.rs -A 3

Repository: strawgate/memagent

Length of output: 319

scan_owned vs scan RSS benchmark comparison is biased by unequal input setup.

The owned path clones the full Vec<u8> before wrapping in Bytes, while the streaming path moves it directly. This inflates the owned-path RSS measurement by the clone cost, confounding the builder/output-mode comparison.

To fix, convert to Bytes once, then clone the cheap Bytes handle (refcount increment) for both paths:

- let data = generate_simple(1_000_000); + let data: bytes::Bytes = generate_simple(1_000_000).into(); let raw_mb = data.len() as f64 / 1_048_576.0; let before = rss_mb(); let mut owned_scanner = StreamingSimdScanner::new(ScanConfig::default()); - let batch = owned_scanner.scan_owned(bytes::Bytes::from(data.clone())).unwrap(); + let batch = owned_scanner.scan_owned(data.clone()).unwrap(); let after_owned = rss_mb(); ... let mid = rss_mb(); let mut streaming_scanner = StreamingSimdScanner::new(ScanConfig::default()); - let batch = streaming_scanner.scan(bytes::Bytes::from(data)).unwrap(); + let batch = streaming_scanner.scan(data.clone()).unwrap();

Affects lines 103–104 and 112.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/logfwd-bench/src/rss.rs` around lines 95 - 97, The RSS benchmark is biased because the owned path clones the full Vec<u8> while the streaming path moves it; instead, convert the input Vec<u8> into a Bytes once and then clone that Bytes handle (cheap refcount increment) for both the scan_owned and scan paths so both measure the same input cost; update the setup used by scan_owned and scan (referencing the scan_owned and scan calls and the variables creating the input Bytes) to create bytes_input = Bytes::from(vec) once and use bytes_input.clone() where the owned path previously cloned the Vec<u8>.

coderabbitai · 2026-04-04T21:50:25Z

+    let mut owned_scanner = StreamingSimdScanner::new(ScanConfig::default());
+    let Ok(owned_batch) = owned_scanner.scan_owned(bytes::Bytes::copy_from_slice(data)) else { return; };

    let mut streaming_scanner = StreamingSimdScanner::new(ScanConfig::default());
    let Ok(streaming_batch) = streaming_scanner.scan(bytes::Bytes::copy_from_slice(data)) else { return; };


⚠️ Potential issue | 🟠 Major

Assert scan() / scan_owned() failure parity before returning.

A one-sided Err is currently ignored. Because the two modes diverge in finalization, the target should fail if one mode succeeds and the other does not.

🧪 Suggested fix

- let mut owned_scanner = StreamingSimdScanner::new(ScanConfig::default()); - let Ok(owned_batch) = owned_scanner.scan_owned(bytes::Bytes::copy_from_slice(data)) else { return; }; - - let mut streaming_scanner = StreamingSimdScanner::new(ScanConfig::default()); - let Ok(streaming_batch) = streaming_scanner.scan(bytes::Bytes::copy_from_slice(data)) else { return; }; + let bytes = bytes::Bytes::copy_from_slice(data); + + let mut owned_scanner = StreamingSimdScanner::new(ScanConfig::default()); + let owned = owned_scanner.scan_owned(bytes.clone()); + + let mut streaming_scanner = StreamingSimdScanner::new(ScanConfig::default()); + let streaming = streaming_scanner.scan(bytes); + + assert_eq!(owned.is_ok(), streaming.is_ok(), "mode success mismatch"); + let (Ok(owned_batch), Ok(streaming_batch)) = (owned, streaming) else { return; };

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let mut owned_scanner = StreamingSimdScanner::new(ScanConfig::default());

let Ok(owned_batch) = owned_scanner.scan_owned(bytes::Bytes::copy_from_slice(data)) else { return; };

let mut streaming_scanner = StreamingSimdScanner::new(ScanConfig::default());

let Ok(streaming_batch) = streaming_scanner.scan(bytes::Bytes::copy_from_slice(data)) else { return; };

let bytes = bytes::Bytes::copy_from_slice(data);

let mut owned_scanner = StreamingSimdScanner::new(ScanConfig::default());

let owned = owned_scanner.scan_owned(bytes.clone());

let mut streaming_scanner = StreamingSimdScanner::new(ScanConfig::default());

let streaming = streaming_scanner.scan(bytes);

assert_eq!(owned.is_ok(), streaming.is_ok(), "mode success mismatch");

let (Ok(owned_batch), Ok(streaming_batch)) = (owned, streaming) else { return; };

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/logfwd-core/fuzz/fuzz_targets/scanner_consistency.rs` around lines 22 - 26, The fuzz target currently returns early when either StreamingSimdScanner::scan_owned or ::scan returns Err, but doesn't check that the other call produced the same Result; ensure parity by calling both (or capturing both Results first) and asserting that both are Ok or both are Err before proceeding—use the StreamingSimdScanner instances/variables (owned_scanner, streaming_scanner) and their results (owned_batch, streaming_batch) to compare outcomes and if one is Ok while the other is Err, cause the target to fail (panic or unwrap) so the mismatch is reported.

coderabbitai · 2026-04-04T21:50:25Z

+    // (StreamingBuilder clears collectors on begin_batch, but field_index persists
    // for schema stability. The column exists but the value should be null.)


⚠️ Potential issue | 🟡 Minor

Inline behavior note is now inaccurate.

This comment says field_index persists, but StreamingBuilder::begin_batch() clears it. Please update the note to match current reset semantics.

Suggested wording

- // (StreamingBuilder clears collectors on begin_batch, but field_index persists - // for schema stability. The column exists but the value should be null.) + // StreamingBuilder resets per-batch state on begin_batch(). + // Prior batch values must not leak into this batch.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/logfwd-core/tests/it/compliance_data.rs` around lines 725 - 726, Update the inline comment to reflect that StreamingBuilder::begin_batch() resets field_index (it does not persist across batches); change the wording near the test around field_index to state that begin_batch clears field_index so the column still exists for schema stability but its value should be null at batch start. Reference StreamingBuilder::begin_batch() and the field_index behavior when editing the comment.

…tached() Delete StorageBuilder and CopyScanner — both are strictly dominated by StreamingBuilder's dual-output architecture (finish_batch for zero-copy StringViewArray, finish_batch_detached for owned StringArray). Rename ZeroCopyScanner → Scanner. With only one scanner type, the "ZeroCopy" qualifier adds no information. The method names communicate the distinction: scan() for wire, scan_detached() for persistence. Renames applied throughout: ZeroCopyScanner → Scanner CopyScanner → (deleted) scan_owned() → scan_detached() finish_batch_owned() → finish_batch_detached() 45 files changed, -1041 net lines. All 1,002 workspace tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

strawgate mentioned this pull request Apr 4, 2026

Rename SimdScanner → CopyScanner and StreamingSimdScanner → ZeroCopyScanner #929

Closed

strawgate mentioned this pull request Apr 4, 2026

test: 4 Kani proofs in logfwd-arrow test Vec/HashSet, not builder logic #923

Closed

coderabbitai Bot requested changes Apr 4, 2026

View reviewed changes

strawgate force-pushed the refactor/delete-storage-builder branch from 2d41c02 to f500df7 Compare April 4, 2026 22:15

strawgate changed the title ~~refactor: delete StorageBuilder — unify on StreamingBuilder dual-output~~ refactor: delete StorageBuilder, unify on Scanner with scan()/scan_detached() Apr 4, 2026

strawgate merged commit ec6f545 into master Apr 4, 2026
12 of 14 checks passed

strawgate deleted the refactor/delete-storage-builder branch April 4, 2026 22:25

This was referenced Apr 4, 2026

Add adaptive dictionary encoding to StorageBuilder #70

Closed

Bug prevention: dictionary compaction after Arrow filter/redaction #235

Closed

This was referenced Apr 6, 2026

Fix scanner boolean coercion, dot attributes, and add missing fuzz targets #1316

Closed

fan-in: integrate module refactor wave + crate-org research outputs #1708

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: delete StorageBuilder, unify on Scanner with scan()/scan_detached()#962

refactor: delete StorageBuilder, unify on Scanner with scan()/scan_detached()#962
strawgate merged 1 commit into
masterfrom
refactor/delete-storage-builder

strawgate commented Apr 4, 2026 •

edited by macroscopeapp Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 4, 2026 •

edited

Loading

Review failed

Pre-merge checks failed

Uh oh!

macroscopeapp Bot commented Apr 4, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 4, 2026

Uh oh!

coderabbitai Bot Apr 4, 2026

Uh oh!

coderabbitai Bot Apr 4, 2026

Uh oh!

coderabbitai Bot Apr 4, 2026

Uh oh!

coderabbitai Bot Apr 4, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	use logfwd_arrow::scanner::StreamingStreamingSimdScanner;
	use logfwd_arrow::scanner::StreamingSimdScanner;

		// (StreamingBuilder clears collectors on begin_batch, but field_index persists
		// for schema stability. The column exists but the value should be null.)

Conversation

strawgate commented Apr 4, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture after this PR

Test plan

Delete StorageBuilder, CopyScanner, and ZeroCopyScanner in favor of a unified Scanner with scan()/scan_detached()

Uh oh!

coderabbitai Bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Possibly related PRs

Pre-merge checks failed

❌ Failed checks (2 errors, 2 warnings, 1 inconclusive)

Uh oh!

macroscopeapp Bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approvability

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

strawgate commented Apr 4, 2026 •

edited by macroscopeapp Bot

Loading

Delete `StorageBuilder`, `CopyScanner`, and `ZeroCopyScanner` in favor of a unified `Scanner` with `scan()`/`scan_detached()`

coderabbitai Bot commented Apr 4, 2026 •

edited

Loading

macroscopeapp Bot commented Apr 4, 2026 •

edited

Loading