Skip to content

Phase 1a: Create logfwd-arrow crate, move builders + SIMD #264

@strawgate

Description

@strawgate

Summary

Create logfwd-arrow crate and move Arrow-dependent code out of logfwd-core.

No backward compatibility needed — we have no external users. Just move the code and update all imports.

What moves

File What Why
streaming_builder.rs StreamingBuilder (entire file) Arrow StringViewArray, bytes::Bytes
storage_builder.rs StorageBuilder (entire file) Arrow arrays, HashMap
scanner.rs (structs only) SimdScanner, StreamingSimdScanner Return RecordBatch
chunk_classify.rs (SIMD only) AVX2/SSE2/NEON platform impls unsafe intrinsics

What stays in logfwd-core

  • ScanBuilder trait (will become FieldSink in Phase 3)
  • scan_into, scan_line, skip_ws (generic scan loop)
  • ChunkIndex struct + compute_real_quotes + prefix_xor (scalar logic)
  • Scalar find_char_mask fallback
  • All Kani proofs

Hard part: SIMD extraction from chunk_classify.rs

chunk_classify.rs currently has both scalar logic (stays in core) and SIMD platform impls (moves to logfwd-arrow). The split point:

Stays in core:

  • ChunkIndex struct + new(), next_quote(), is_in_string(), scan_string(), skip_nested()
  • compute_real_quotes(), prefix_xor()
  • Scalar find_char_mask() (the #[cfg(not(any(x86_64, aarch64)))] fallback)
  • The #[cfg(kani)] mod verification block

Moves to logfwd-arrow:

  • mod x86 (AVX2 + SSE2 impls)
  • mod aarch64_impl (NEON impl)
  • The platform dispatch function find_quotes_and_backslashes()
  • All SIMD-specific tests

The connection between them: core's ChunkIndex::new() currently calls find_quotes_and_backslashes() directly. After the split, core should define a CharDetector trait and ChunkIndex::new() should be generic over it (or take a function pointer). logfwd-arrow implements the trait with SIMD.

Steps

  1. Create crates/logfwd-arrow/Cargo.toml (deps: logfwd-core, arrow, bytes)
  2. Move streaming_builder.rs, storage_builder.rs to logfwd-arrow/src/
  3. Extract SimdScanner + StreamingSimdScanner from scanner.rs → logfwd-arrow/src/scanner.rs
  4. Extract SIMD from chunk_classify.rs → logfwd-arrow/src/simd.rs
  5. Define CharDetector trait in core, make ChunkIndex::new generic over it
  6. Implement CharDetector with SIMD in logfwd-arrow
  7. Update all imports across workspace (logfwd, logfwd-transform, logfwd-bench, logfwd-output)
  8. All existing tests pass

Assignability

The file moves are Copilot-friendly. The CharDetector trait + SIMD split needs design review.

Consider splitting into two PRs:

  • PR A: Move builders + scanner structs (purely mechanical)
  • PR B: SIMD extraction + CharDetector trait (needs thought)

Parent: #262

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions