Skip to content

refactor: delete StorageBuilder, unify on Scanner with scan()/scan_detached()#962

Merged
strawgate merged 1 commit into
masterfrom
refactor/delete-storage-builder
Apr 4, 2026
Merged

refactor: delete StorageBuilder, unify on Scanner with scan()/scan_detached()#962
strawgate merged 1 commit into
masterfrom
refactor/delete-storage-builder

Conversation

@strawgate
Copy link
Copy Markdown
Owner

@strawgate strawgate commented Apr 4, 2026

Summary

  • Delete StorageBuilder and CopyScanner entirely (-1041 net lines)
  • Rename ZeroCopyScannerScanner (only one scanner now, qualifier is noise)
  • Rename scan_owned()scan_detached(), finish_batch_owned()finish_batch_detached() (consistent with detach module vocabulary)
  • Migrate all 45 files: tests, benchmarks, fuzz targets, examples, docs

Follows #941 which proved finish_batch_detached() dominates StorageBuilder at every benchmark.

Architecture after this PR

pub struct Scanner { /* StreamingBuilder inside */ }

impl Scanner {
    pub fn scan(&mut self, buf: Bytes) -> RecordBatch       // StringViewArray, zero-copy
    pub fn scan_detached(&mut self, buf: Bytes) -> RecordBatch  // StringArray, self-contained
}

One builder (StreamingBuilder), two finish modes:

  • finish_batch()StringViewArray — views into input buffer (hot/wire path)
  • finish_batch_detached()StringArray — bulk copy at finalization (persistence path)

Test plan

  • cargo build --workspace clean
  • cargo test --workspace — 1,002 tests pass
  • cargo clippy --workspace -- -D warnings clean
  • cargo fmt --all --check clean
  • Zero StorageBuilder/CopyScanner/ZeroCopyScanner/scan_owned/finish_batch_owned references in .rs or .md
  • CI green

🤖 Generated with Claude Code

Note

Delete StorageBuilder, CopyScanner, and ZeroCopyScanner in favor of a unified Scanner with scan()/scan_detached()

  • Removes StorageBuilder and the two separate scanner types (CopyScanner, ZeroCopyScanner) from logfwd-arrow, replacing them with a single Scanner that exposes two methods: scan() for zero-copy StringViewArray output and scan_detached() for owned StringArray output.
  • Renames finish_batch_owned to finish_batch_detached on StreamingBuilder to align with the new naming convention; no functional logic changes.
  • Updates all call sites across benchmarks, fuzz targets, tests, examples, and production code to use Scanner::new() with the appropriate method.
  • Risk: StorageBuilder, CopyScanner, and ZeroCopyScanner are no longer exported from logfwd-arrow; any code referencing these symbols will not compile.

Macroscope summarized f500df7.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 4, 2026

Caution

Review failed

Pull request was closed or merged during review

Walkthrough

This PR consolidates dual-scanner infrastructure (CopyScanner backed by StorageBuilder and ZeroCopyScanner backed by StreamingBuilder) into a single Scanner type backed by StreamingBuilder. The StorageBuilder module is removed entirely. Two scan modes are now exposed through Scanner: scan(bytes::Bytes) produces zero-copy StringViewArray columns backed by the input buffer, while scan_detached(bytes::Bytes) produces self-contained StringArray columns with owned data. StreamingBuilder::finish_batch_owned() is renamed to finish_batch_detached(). Documentation is updated to describe scan modes rather than builder types. All benchmarks, tests, examples, and implementations are migrated to use the unified Scanner API.

Possibly related PRs


Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (2 errors, 2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
High-Quality Rust Practices ❌ Error PR violates guideline #2 by including multiple .clone()/allocation calls in hot-path benchmark loops without required justification comments. Move per-iteration allocations outside benchmark loops or add explicit // PERF: allocation justified because [reason] comments at each flagged location.
Crate Boundary And Dependency Integrity ❌ Error PR violates workspace dependency management by adding bytes=1 to 6 individual crates without centralizing in workspace.dependencies Move bytes={version=1} to workspace.dependencies in root Cargo.toml, update crates to use bytes={workspace=true}, and justify logfwd-transform production dependency usage.
Documentation Thoroughly Updated ⚠️ Warning VERIFICATION.md contains stale reference to deleted storage_builder.rs module in proof requirements table. Remove the logfwd-arrow/storage_builder.rs row from VERIFICATION.md module table as it has no verification coverage.
Maintainer Fitness ⚠️ Warning PR fails Maintainer Fitness criteria: red CI blocking merge, hot-path benchmarks contain per-iteration allocations violating guidelines, integration test uses non-production path missing zero-copy regressions, documentation gaps on API semantics and historical accuracy. Fix red CI job. Remove per-iteration allocations from scanner.rs, pipeline.rs, es_throughput.rs benchmarks. Update integration.rs to test production zero-copy path. Restore Phase 10b history accuracy. Clarify scanner.md distinguishing scan() vs scan_detached(). Confirm with benchmark results.
Formal Verification Coverage ❓ Inconclusive PR involves primarily function renames (ZeroCopyScanner→Scanner, finish_batch_owned→finish_batch_detached, scan_owned→scan_detached) rather than new public functions. VERIFICATION.md file location could not be confirmed, and existing Kani proof status for renamed functions is unclear. Verify VERIFICATION.md exists and was updated; confirm Kani proofs exist for renamed functions; ensure Proptest covers Scanner dual-mode behavior across boundary conditions; document exemption rationale if applicable.

Comment @coderabbitai help to get the list of available commands and usage tips.

@macroscopeapp
Copy link
Copy Markdown

macroscopeapp Bot commented Apr 4, 2026

Approvability

Verdict: Needs human review

This is a major refactor that deletes the entire StorageBuilder implementation (~1000 lines) and consolidates two scanner types (CopyScanner, ZeroCopyScanner) into a unified Scanner with scan()/scan_detached() modes. While the changes are mechanically straightforward, the API surface change and code path consolidation warrant human review to verify the functionality is properly preserved.

You can customize Macroscope's approvability policy. Learn more.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/logfwd-core/fuzz/fuzz_targets/scanner_consistency.rs (1)

37-50: ⚠️ Potential issue | 🟠 Major

This still lets schema and conflict-column drift slip through.

Comparing BTreeSets throws away column order, and the _ => {} arm silently accepts StructArray conflict columns plus any other unexpected type pair. That means scan() and scan_owned() can diverge on mixed-type fields or schema layout while this target still passes. Compare ordered fields, recurse into DataType::Struct(_) children, and treat any unhandled type pairing as a fuzz failure.

Based on learnings: Applies to crates/logfwd-arrow/src/**/*.rs : Use bare names for single-type fields and StructArray for column naming conflicts.

Also applies to: 53-114

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-core/fuzz/fuzz_targets/scanner_consistency.rs` around lines 37
- 50, The current equality check uses BTreeSet of column names which loses
ordering and misses StructArray vs bare-type mismatches; replace the set
comparison with an ordered, index-by-index comparison of
owned_batch.schema().fields() vs streaming_batch.schema().fields(), and when a
field's DataType is a Struct recurse into its children to compare nested field
names/types; treat any unexpected or unhandled type pairing (e.g., one side
StructArray and the other a concrete type) as a fuzz failure (panic/assert) so
scan() and scan_owned() divergences fail the target; apply the same
ordered/composite checks for the analogous block covering lines ~53-114 and
follow the convention of using bare names for single-type fields and StructArray
for conflict columns.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/logfwd-arrow/tests/allocation_regression.rs`:
- Around line 53-57: The test is measuring allocation regressions but currently
does heavy input preparation by calling bytes::Bytes::from(data.clone()) inside
the measurement windows; move the expensive allocation out of the Region::new()
measured blocks by creating a single bytes::Bytes value from data (e.g., let
input = bytes::Bytes::from(data.clone()) or construct input once without
including it in the timed region) before each measured region, and then inside
the measured loops call scanner.scan_owned(input.clone()).unwrap() so only the
cheap refcounted clone happens during measurement; update every measured block
that currently uses scanner.scan_owned(bytes::Bytes::from(data.clone())) to use
the pre-allocated input and clone that handle instead.

In `@crates/logfwd-bench/benches/pipeline.rs`:
- Line 14: The import has a duplicated "Streaming" identifier: replace the
incorrect use of StreamingStreamingSimdScanner with the correct type name
StreamingSimdScanner in the use statement and update any other references to
StreamingStreamingSimdScanner in this file (e.g., variable types, instantiations
or pattern matches) to StreamingSimdScanner so the code compiles.

In `@crates/logfwd-bench/src/rss.rs`:
- Around line 95-97: The RSS benchmark is biased because the owned path clones
the full Vec<u8> while the streaming path moves it; instead, convert the input
Vec<u8> into a Bytes once and then clone that Bytes handle (cheap refcount
increment) for both the scan_owned and scan paths so both measure the same input
cost; update the setup used by scan_owned and scan (referencing the scan_owned
and scan calls and the variables creating the input Bytes) to create bytes_input
= Bytes::from(vec) once and use bytes_input.clone() where the owned path
previously cloned the Vec<u8>.

In `@crates/logfwd-core/fuzz/fuzz_targets/scanner_consistency.rs`:
- Around line 22-26: The fuzz target currently returns early when either
StreamingSimdScanner::scan_owned or ::scan returns Err, but doesn't check that
the other call produced the same Result; ensure parity by calling both (or
capturing both Results first) and asserting that both are Ok or both are Err
before proceeding—use the StreamingSimdScanner instances/variables
(owned_scanner, streaming_scanner) and their results (owned_batch,
streaming_batch) to compare outcomes and if one is Ok while the other is Err,
cause the target to fail (panic or unwrap) so the mismatch is reported.

In `@crates/logfwd-core/tests/it/compliance_data.rs`:
- Around line 725-726: Update the inline comment to reflect that
StreamingBuilder::begin_batch() resets field_index (it does not persist across
batches); change the wording near the test around field_index to state that
begin_batch clears field_index so the column still exists for schema stability
but its value should be null at batch start. Reference
StreamingBuilder::begin_batch() and the field_index behavior when editing the
comment.

In `@dev-docs/ARCHITECTURE.md`:
- Around line 162-175: Doc text incorrectly claims finish_batch() and
StreamingSimdScanner::scan(Bytes) are always zero-copy; update the wording to
qualify that finish_batch() is zero-copy only when no decoded strings exist and
that when decoded_buf is populated the implementation builds a combined buffer
(causing a copy) before creating the StringViewArray; change the description of
StreamingSimdScanner::scan(Bytes) to state it returns a zero-copy RecordBatch
only if no decoded strings are present, whereas
StreamingSimdScanner::scan_owned(Bytes) still produces an owned StringArray via
finish_batch_owned() for persistence/compression.

---

Outside diff comments:
In `@crates/logfwd-core/fuzz/fuzz_targets/scanner_consistency.rs`:
- Around line 37-50: The current equality check uses BTreeSet of column names
which loses ordering and misses StructArray vs bare-type mismatches; replace the
set comparison with an ordered, index-by-index comparison of
owned_batch.schema().fields() vs streaming_batch.schema().fields(), and when a
field's DataType is a Struct recurse into its children to compare nested field
names/types; treat any unexpected or unhandled type pairing (e.g., one side
StructArray and the other a concrete type) as a fuzz failure (panic/assert) so
scan() and scan_owned() divergences fail the target; apply the same
ordered/composite checks for the analogous block covering lines ~53-114 and
follow the convention of using bare names for single-type fields and StructArray
for conflict columns.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: 8424e577-003d-44e3-81bf-6cf987f8e129

📥 Commits

Reviewing files that changed from the base of the PR and between 91b628e and 2d41c02.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (37)
  • DEVELOPING.md
  • book/src/architecture/pipeline.md
  • book/src/development/contributing.md
  • crates/logfwd-arrow/README.md
  • crates/logfwd-arrow/src/conflict_schema.rs
  • crates/logfwd-arrow/src/lib.rs
  • crates/logfwd-arrow/src/scanner.rs
  • crates/logfwd-arrow/src/storage_builder.rs
  • crates/logfwd-arrow/tests/allocation_regression.rs
  • crates/logfwd-bench/benches/pipeline.rs
  • crates/logfwd-bench/src/e2e_profile.rs
  • crates/logfwd-bench/src/es_throughput.rs
  • crates/logfwd-bench/src/explore.rs
  • crates/logfwd-bench/src/rss.rs
  • crates/logfwd-bench/src/sizes.rs
  • crates/logfwd-core/benches/scanner.rs
  • crates/logfwd-core/examples/arrow_ipc_roundtrip.rs
  • crates/logfwd-core/fuzz/fuzz_targets/scanner.rs
  • crates/logfwd-core/fuzz/fuzz_targets/scanner_consistency.rs
  • crates/logfwd-core/fuzz/fuzz_targets/scanner_sink.rs
  • crates/logfwd-core/fuzz/fuzz_targets/scanner_transform.rs
  • crates/logfwd-core/fuzz/fuzz_targets/structural_index.rs
  • crates/logfwd-core/src/scan_config.rs
  • crates/logfwd-core/src/scanner.rs
  • crates/logfwd-core/tests/it/compliance_data.rs
  • crates/logfwd-core/tests/it/scanner_conformance.rs
  • crates/logfwd/Cargo.toml
  • crates/logfwd/src/pipeline.rs
  • crates/logfwd/tests/it/integration.rs
  • dev-docs/ARCHITECTURE.md
  • dev-docs/DESIGN.md
  • dev-docs/SCANNER_CONTRACT.md
  • dev-docs/VERIFICATION.md
  • dev-docs/research/arrow-ipc-feasibility.md
  • dev-docs/research/type-suffix-redesign.md
  • docs/ci/documentation-thoroughly-updated.md
  • docs/ci/high-quality-rust-practices.md
💤 Files with no reviewable changes (1)
  • crates/logfwd-arrow/src/storage_builder.rs

Comment on lines +53 to +57
drop(
scanner
.scan_owned(bytes::Bytes::from(data.clone()))
.unwrap(),
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify measured scan_owned loops are cloning Vec input in-place.
rg -n -C2 'scan_owned\(bytes::Bytes::from\([a-zA-Z0-9_]+\.clone\(\)\)\)' crates/logfwd-arrow/tests/allocation_regression.rs

Repository: strawgate/memagent

Length of output: 912


🏁 Script executed:

# Read context around the allocation test to confirm measurement setup
sed -n '40,85p' crates/logfwd-arrow/tests/allocation_regression.rs

Repository: strawgate/memagent

Length of output: 1168


🏁 Script executed:

# Also check the latter measurement sites
sed -n '120,145p' crates/logfwd-arrow/tests/allocation_regression.rs

Repository: strawgate/memagent

Length of output: 963


🏁 Script executed:

# Verify bytes::Bytes is indeed refcounted/cheap clone
rg -A5 'pub struct Bytes' --type rust

Repository: strawgate/memagent

Length of output: 44


Allocation regression measurements are polluted by input cloning inside measurement windows.

Lines 53–57, 62–66, 72–76, 129–131, 136–138: data.clone() allocates per iteration inside Region::new() measurement blocks. The test should measure scanner allocation behavior, not input preparation cost. Pre-allocate bytes::Bytes once before measured regions and clone the refcounted handle inside loops (cheap operation).

Proposed fix
 fn owned_scanner_no_leak_across_batches() {
     let mut scanner = StreamingSimdScanner::new(ScanConfig::default());
-    let data = make_ndjson(500);
+    let input: bytes::Bytes = make_ndjson(500).into();

     for _ in 0..5 {
-        drop(
-            scanner
-                .scan_owned(bytes::Bytes::from(data.clone()))
-                .unwrap(),
-        );
+        drop(scanner.scan_owned(input.clone()).unwrap());
     }

     let reg1 = Region::new(GLOBAL);
     for _ in 0..10 {
-        drop(
-            scanner
-                .scan_owned(bytes::Bytes::from(data.clone()))
-                .unwrap(),
-        );
+        drop(scanner.scan_owned(input.clone()).unwrap());
     }

     let reg2 = Region::new(GLOBAL);
     for _ in 0..10 {
-        drop(
-            scanner
-                .scan_owned(bytes::Bytes::from(data.clone()))
-                .unwrap(),
-        );
+        drop(scanner.scan_owned(input.clone()).unwrap());
     }
-    let data_500 = make_ndjson(500);
+    let data_500: bytes::Bytes = make_ndjson(500).into();
     let reg_500 = Region::new(GLOBAL);
     let _ = scanner
-        .scan_owned(bytes::Bytes::from(data_500.clone()))
+        .scan_owned(data_500.clone())
         .unwrap();

-    let data_5000 = make_ndjson(5000);
+    let data_5000: bytes::Bytes = make_ndjson(5000).into();
     let reg_5000 = Region::new(GLOBAL);
     let _ = scanner
-        .scan_owned(bytes::Bytes::from(data_5000.clone()))
+        .scan_owned(data_5000.clone())
         .unwrap();
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
drop(
scanner
.scan_owned(bytes::Bytes::from(data.clone()))
.unwrap(),
);
drop(scanner.scan_owned(input.clone()).unwrap());
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-arrow/tests/allocation_regression.rs` around lines 53 - 57, The
test is measuring allocation regressions but currently does heavy input
preparation by calling bytes::Bytes::from(data.clone()) inside the measurement
windows; move the expensive allocation out of the Region::new() measured blocks
by creating a single bytes::Bytes value from data (e.g., let input =
bytes::Bytes::from(data.clone()) or construct input once without including it in
the timed region) before each measured region, and then inside the measured
loops call scanner.scan_owned(input.clone()).unwrap() so only the cheap
refcounted clone happens during measurement; update every measured block that
currently uses scanner.scan_owned(bytes::Bytes::from(data.clone())) to use the
pre-allocated input and clone that handle instead.

Comment thread crates/logfwd-bench/benches/pipeline.rs Outdated

use logfwd_arrow::scanner::SimdScanner;
use bytes::Bytes;
use logfwd_arrow::scanner::StreamingStreamingSimdScanner;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Typo: StreamingStreamingSimdScanner should be StreamingSimdScanner.

Duplicated "Streaming" prefix will cause compilation failure.

🐛 Fix
-use logfwd_arrow::scanner::StreamingStreamingSimdScanner;
+use logfwd_arrow::scanner::StreamingSimdScanner;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
use logfwd_arrow::scanner::StreamingStreamingSimdScanner;
use logfwd_arrow::scanner::StreamingSimdScanner;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-bench/benches/pipeline.rs` at line 14, The import has a
duplicated "Streaming" identifier: replace the incorrect use of
StreamingStreamingSimdScanner with the correct type name StreamingSimdScanner in
the use statement and update any other references to
StreamingStreamingSimdScanner in this file (e.g., variable types, instantiations
or pattern matches) to StreamingSimdScanner so the code compiles.

Comment thread crates/logfwd-bench/src/rss.rs Outdated
Comment on lines 95 to 97
// === scan_owned (copies strings) vs scan (zero-copy) ===
println!("--- scan_owned vs scan (1M simple lines) ---");
{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify scan_owned path clones full input while scan path does not in RSS comparison block.
rg -n -C3 'scan_owned\(bytes::Bytes::from\(data.clone\(\)\)\)|scan\(bytes::Bytes::from\(data\)\)' crates/logfwd-bench/src/rss.rs

Repository: strawgate/memagent

Length of output: 855


🏁 Script executed:

# Get full context of the benchmark block and function definitions
sed -n '90,130p' crates/logfwd-bench/src/rss.rs

Repository: strawgate/memagent

Length of output: 2004


🏁 Script executed:

# Find generate_simple function definition
rg -n 'fn generate_simple' crates/logfwd-bench/src/rss.rs -A 3

Repository: strawgate/memagent

Length of output: 319


scan_owned vs scan RSS benchmark comparison is biased by unequal input setup.

The owned path clones the full Vec<u8> before wrapping in Bytes, while the streaming path moves it directly. This inflates the owned-path RSS measurement by the clone cost, confounding the builder/output-mode comparison.

To fix, convert to Bytes once, then clone the cheap Bytes handle (refcount increment) for both paths:

-    let data = generate_simple(1_000_000);
+    let data: bytes::Bytes = generate_simple(1_000_000).into();
     let raw_mb = data.len() as f64 / 1_048_576.0;

     let before = rss_mb();
     let mut owned_scanner = StreamingSimdScanner::new(ScanConfig::default());
-    let batch = owned_scanner.scan_owned(bytes::Bytes::from(data.clone())).unwrap();
+    let batch = owned_scanner.scan_owned(data.clone()).unwrap();
     let after_owned = rss_mb();
     ...
     let mid = rss_mb();
     let mut streaming_scanner = StreamingSimdScanner::new(ScanConfig::default());
-    let batch = streaming_scanner.scan(bytes::Bytes::from(data)).unwrap();
+    let batch = streaming_scanner.scan(data.clone()).unwrap();

Affects lines 103–104 and 112.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-bench/src/rss.rs` around lines 95 - 97, The RSS benchmark is
biased because the owned path clones the full Vec<u8> while the streaming path
moves it; instead, convert the input Vec<u8> into a Bytes once and then clone
that Bytes handle (cheap refcount increment) for both the scan_owned and scan
paths so both measure the same input cost; update the setup used by scan_owned
and scan (referencing the scan_owned and scan calls and the variables creating
the input Bytes) to create bytes_input = Bytes::from(vec) once and use
bytes_input.clone() where the owned path previously cloned the Vec<u8>.

Comment on lines 22 to 26
let mut owned_scanner = StreamingSimdScanner::new(ScanConfig::default());
let Ok(owned_batch) = owned_scanner.scan_owned(bytes::Bytes::copy_from_slice(data)) else { return; };

let mut streaming_scanner = StreamingSimdScanner::new(ScanConfig::default());
let Ok(streaming_batch) = streaming_scanner.scan(bytes::Bytes::copy_from_slice(data)) else { return; };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Assert scan() / scan_owned() failure parity before returning.

A one-sided Err is currently ignored. Because the two modes diverge in finalization, the target should fail if one mode succeeds and the other does not.

🧪 Suggested fix
-    let mut owned_scanner = StreamingSimdScanner::new(ScanConfig::default());
-    let Ok(owned_batch) = owned_scanner.scan_owned(bytes::Bytes::copy_from_slice(data)) else { return; };
-
-    let mut streaming_scanner = StreamingSimdScanner::new(ScanConfig::default());
-    let Ok(streaming_batch) = streaming_scanner.scan(bytes::Bytes::copy_from_slice(data)) else { return; };
+    let bytes = bytes::Bytes::copy_from_slice(data);
+
+    let mut owned_scanner = StreamingSimdScanner::new(ScanConfig::default());
+    let owned = owned_scanner.scan_owned(bytes.clone());
+
+    let mut streaming_scanner = StreamingSimdScanner::new(ScanConfig::default());
+    let streaming = streaming_scanner.scan(bytes);
+
+    assert_eq!(owned.is_ok(), streaming.is_ok(), "mode success mismatch");
+    let (Ok(owned_batch), Ok(streaming_batch)) = (owned, streaming) else { return; };
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let mut owned_scanner = StreamingSimdScanner::new(ScanConfig::default());
let Ok(owned_batch) = owned_scanner.scan_owned(bytes::Bytes::copy_from_slice(data)) else { return; };
let mut streaming_scanner = StreamingSimdScanner::new(ScanConfig::default());
let Ok(streaming_batch) = streaming_scanner.scan(bytes::Bytes::copy_from_slice(data)) else { return; };
let bytes = bytes::Bytes::copy_from_slice(data);
let mut owned_scanner = StreamingSimdScanner::new(ScanConfig::default());
let owned = owned_scanner.scan_owned(bytes.clone());
let mut streaming_scanner = StreamingSimdScanner::new(ScanConfig::default());
let streaming = streaming_scanner.scan(bytes);
assert_eq!(owned.is_ok(), streaming.is_ok(), "mode success mismatch");
let (Ok(owned_batch), Ok(streaming_batch)) = (owned, streaming) else { return; };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-core/fuzz/fuzz_targets/scanner_consistency.rs` around lines 22
- 26, The fuzz target currently returns early when either
StreamingSimdScanner::scan_owned or ::scan returns Err, but doesn't check that
the other call produced the same Result; ensure parity by calling both (or
capturing both Results first) and asserting that both are Ok or both are Err
before proceeding—use the StreamingSimdScanner instances/variables
(owned_scanner, streaming_scanner) and their results (owned_batch,
streaming_batch) to compare outcomes and if one is Ok while the other is Err,
cause the target to fail (panic or unwrap) so the mismatch is reported.

Comment on lines +725 to 726
// (StreamingBuilder clears collectors on begin_batch, but field_index persists
// for schema stability. The column exists but the value should be null.)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inline behavior note is now inaccurate.

This comment says field_index persists, but StreamingBuilder::begin_batch() clears it. Please update the note to match current reset semantics.

Suggested wording
-    // (StreamingBuilder clears collectors on begin_batch, but field_index persists
-    // for schema stability. The column exists but the value should be null.)
+    // StreamingBuilder resets per-batch state on begin_batch().
+    // Prior batch values must not leak into this batch.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-core/tests/it/compliance_data.rs` around lines 725 - 726,
Update the inline comment to reflect that StreamingBuilder::begin_batch() resets
field_index (it does not persist across batches); change the wording near the
test around field_index to state that begin_batch clears field_index so the
column still exists for schema stability but its value should be null at batch
start. Reference StreamingBuilder::begin_batch() and the field_index behavior
when editing the comment.

Comment thread dev-docs/ARCHITECTURE.md Outdated
…tached()

Delete StorageBuilder and CopyScanner — both are strictly dominated by
StreamingBuilder's dual-output architecture (finish_batch for zero-copy
StringViewArray, finish_batch_detached for owned StringArray).

Rename ZeroCopyScanner → Scanner. With only one scanner type, the
"ZeroCopy" qualifier adds no information. The method names communicate
the distinction: scan() for wire, scan_detached() for persistence.

Renames applied throughout:
  ZeroCopyScanner     → Scanner
  CopyScanner         → (deleted)
  scan_owned()        → scan_detached()
  finish_batch_owned() → finish_batch_detached()

45 files changed, -1041 net lines. All 1,002 workspace tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@strawgate strawgate force-pushed the refactor/delete-storage-builder branch from 2d41c02 to f500df7 Compare April 4, 2026 22:15
@strawgate strawgate changed the title refactor: delete StorageBuilder — unify on StreamingBuilder dual-output refactor: delete StorageBuilder, unify on Scanner with scan()/scan_detached() Apr 4, 2026
@strawgate strawgate merged commit ec6f545 into master Apr 4, 2026
12 of 14 checks passed
@strawgate strawgate deleted the refactor/delete-storage-builder branch April 4, 2026 22:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant