feat: json/json_int/json_float UDFs for raw-first extraction by strawgate · Pull Request #645 · strawgate/fastforward

strawgate · 2026-04-02T05:46:18Z

Summary

Adds json(_raw, 'key'), json_int(_raw, 'key'), json_float(_raw, 'key') scalar UDFs backed by our SIMD scanner
Registers all three UDFs in SqlTransform::ensure_context() so they're available in all user SQL
Adds logfwd-arrow + bytes deps to logfwd-transform (needed by UDF impl)
Includes 38 integration tests covering extraction, type coercion, WHERE clauses, NULL handling, Utf8View compatibility, and passthrough
Includes a side-by-side benchmark (raw_first_bench.rs) comparing the current scanner-first path vs raw-first path

Motivation

This is the prototype for the raw-first architecture: readers store the original JSON line in a _raw column, and SQL extracts fields on demand via these UDFs. This eliminates the type-suffix column naming problem entirely for users who don't need pre-extracted typed columns.

Known limitations (tracked for follow-up)

Each json(_raw, ...) call re-parses _raw independently — N fields = N scans. A DataFusion OptimizerRule that batches all json*(_raw, ...) calls into a single scan is the planned fix.
NULL _raw rows cause a scanner row-count mismatch error (scanner skips blank lines).
json_int(_raw, 'key') on quoted-string numerics (e.g. "200") returns NULL rather than parsing the string.

Test plan

cargo test -p logfwd-transform --test json_udf_tests — 38 tests pass
cargo test -p logfwd-transform — existing lib tests still pass
cargo clippy -p logfwd-transform -- -D warnings — clean

🤖 Generated with Claude Code

Add json(), json_int(), json_float() DataFusion UDFs that extract fields from raw JSON strings using our SIMD scanner. These enable the raw-first architecture where readers produce _raw lines and SQL handles extraction. UDFs: - json(_raw, 'key') → Utf8 (string value) - json_int(_raw, 'key') → Int64 (NULL if not int) - json_float(_raw, 'key') → Float64 (NULL if not float) Side-by-side benchmark comparing: - Path A (current): scanner → typed columns → SQL - Path B (raw-first): _raw column → json UDFs in SQL Results (50K rows × 50 fields): - Extraction: A=184ms (197 MB/s) vs B=738ms (49 MB/s) - Passthrough: A=165ms (219 MB/s) vs B=6ms (6,098 MB/s) Next step: optimizer rule to merge multiple json() calls sharing the same _raw input into a single parse pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…raction When user SQL contains multiple json()/json_int()/json_float() calls sharing the same _raw column, the preprocess pass: 1. Extracts all referenced field names from the SQL 2. Runs the SIMD scanner once for those fields 3. Adds extracted columns to the batch 4. Rewrites the SQL to use direct column references Benchmark results (10K rows × 100 fields, extracting 10): - Path A (scanner+SQL): 106ms (133 MB/s) - Path B (raw UDFs, N parses): 926ms (15 MB/s) - Path C (raw+preprocess): 44ms (319 MB/s) ← 2.4x faster than A Path C beats Path A at wide schemas because it only parses the fields the SQL references, while Path A parses all 100 fields. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The shared parse cache approach doesn't work because DataFusion may call UDFs on filtered subsets of rows — the cache holds the full batch but DataFusion expects results matching the filtered row count. UDFs now parse independently per call (correct but slow for many fields). The json_preprocess module remains as the optimization path: it runs the scanner once before DataFusion, adds extracted columns to the batch, and rewrites the SQL to use column references. Benchmark shows preprocess path (C) matches or beats the current scanner pipeline (A) for wide schemas while keeping the raw-first architecture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Major refactor of the json UDF prototype: - Deduplicate three separate structs into single `JsonExtractUdf` parameterized by `JsonExtractMode` enum (Str, Int, Float) - Fix Utf8View crash: accept Utf8, Utf8View, LargeUtf8 in signatures and coerce to StringArray before processing - Fix row-count mismatch: validate scanner output rows match input - Propagate scanner errors as DataFusionError::Execution instead of silently returning nulls - Remove HashMap allocation in hot path — return RecordBatch directly - Delete json_preprocess.rs (SQL string rewriting too fragile) Add 38 integration tests covering: - Basic extraction (string, int, float) - Type coercion (string-to-int parsing, int-to-float) - Null handling (NULL rows, missing fields, mixed) - Edge cases (empty {}, non-JSON, malformed, 12KB strings, unicode, nested objects, arrays, booleans, JSON null, duplicate keys, large integers, negative numbers, scientific notation) - WHERE clause integration - SELECT * passthrough - Utf8View input compatibility Known issues documented in tests: - NULL _raw rows cause scanner row-count mismatch - json_int cannot parse quoted-string numeric values ("200") Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-04-02T05:46:29Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4ca63053-fea1-498b-ae7b-bea38748d15f

📥 Commits

Reviewing files that changed from the base of the PR and between b946ef6 and 3515feb.

📒 Files selected for processing (1)

crates/logfwd-transform/src/udf/json_extract.rs

Walkthrough

This change adds three DataFusion scalar UDFs—json, json_int, and json_float—implemented by a new JsonExtractUdf (with JsonExtractMode) that extracts named fields from NDJSON in a _raw column using StreamingSimdScanner. The UDFs are registered in SqlTransform::ensure_context. logfwd-arrow (and bytes) were promoted from dev-dependencies into regular dependencies in crates/logfwd-transform/Cargo.toml. New integration tests and a side-by-side benchmark exercising extraction, coercion, null handling, and performance scenarios were added.

Possibly related PRs

feat: json/json_int/json_float UDFs for raw-first extraction #645: Adds the same json/json_int/json_float UDFs, json_extract module, tests, and benchmark — appears to be the same or directly overlapping changes.
feat: create logfwd-arrow crate, move builders + scanner structs (Step 1) #307: Introduces/relocates scanner and builder types (e.g., StreamingSimdScanner) in logfwd-arrow that the new UDFs depend on.
feat: streaming structural scanner — zero-alloc bitmask iteration #525: Modifies streaming JSON scanner APIs (e.g., scan_streaming/structural scanning) used by the UDFs and benchmark, affecting the scanning pipeline.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/logfwd-transform/Cargo.toml`:
- Around line 22-27: Remove the redundant dev-dependency entry for bytes in
Cargo.toml: delete the bytes = "1" line under [dev-dependencies] because bytes
is already declared under [dependencies]; update the file so only the single
bytes = "1" in the main dependencies block remains.

In `@crates/logfwd-transform/src/udf/json_extract.rs`:
- Around line 232-247: Replace the unsafe .unwrap() on the downcasted array with
a failure-aware message: after casting with arrow::compute::cast(&arr,
&DataType::Utf8) change the .downcast_ref::<StringArray>().unwrap() call to
.downcast_ref::<StringArray>().expect("expected Utf8 cast to produce StringArray
in json_extract parsing") (or return a proper Err if preferred) so failures
include a clear diagnostic; update the same occurrence in the json extraction
code path where str_arr is defined to avoid panics in production.
- Around line 59-67: The suffix_order method contains dead empty-string entries
that will never match any scanner output; update suffix_order (the match arms
for Self::Str, Self::Int, Self::Float) to remove the "" entries so each arm only
returns the actual suffixes the scanner emits (e.g., Self::Str -> &["_str",
"_int", "_float"], Self::Int -> &["_int"], Self::Float -> &["_float", "_int"]).
Keep the function signature and return type unchanged but adjust the returned
static slices to omit the empty-string placeholders.
- Around line 74-109: Add a brief doc comment above the parse_raw function
explaining that the post-scan row-count validation (the comparison between
batch.num_rows() and raw_array.len()) intentionally fails when the `_raw` column
contains NULLs; mention that parse_raw reconstructs NDJSON from raw_array, NULL
entries become empty lines and thus cause a mismatch and that this is a known
limitation covered by tests. Reference parse_raw, raw_array, and the
batch.num_rows() check so future maintainers understand this is intentional
rather than a bug.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 20fed77d-b6a7-48e8-97ea-f63941c28bdd

📥 Commits

Reviewing files that changed from the base of the PR and between 915ad2f and 09bbcbc.

📒 Files selected for processing (6)

crates/logfwd-transform/Cargo.toml
crates/logfwd-transform/src/lib.rs
crates/logfwd-transform/src/udf/json_extract.rs
crates/logfwd-transform/src/udf/mod.rs
crates/logfwd-transform/tests/json_udf_tests.rs
crates/logfwd-transform/tests/raw_first_bench.rs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/logfwd-transform/tests/json_udf_tests.rs`:
- Around line 275-279: The test currently asserts only that result.is_err()
which is too broad; update the assertion in
crates/logfwd-transform/tests/json_udf_tests.rs (both occurrences around the
result variable at the shown blocks and the similar block at 792-795) to assert
the error is the specific DataFusion execution error containing the scanner
row-count mismatch message (the DataFusionError::Execution text "scanner row
count mismatch: got {} rows, expected {}"). Concretely, unwrap the Err from
result, convert it to its display/string form and assert that it contains (or
equals) the "scanner row count mismatch" substring so the test only passes for
that known limitation and not any unrelated failure.
- Around line 79-84: The helper function `query` currently uses `.find()` on
`batches` which returns only the first non-empty DataFusion batch and drops
subsequent non-empty batches; update `query` (the async function that builds
`ctx` via `make_ctx`, calls `ctx.sql(sql).await.unwrap()`, and collects batches
with `df.collect().await.unwrap()`) to filter for all non-empty batches, and
then concatenate them into a single RecordBatch (using the Arrow/DataFusion
concat utility that accepts a schema and slice of RecordBatches) returning
Some(concatenated_batch) or None if there are no non-empty batches.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: f5498ebf-0942-4611-953d-3eca7c31f2c1

📥 Commits

Reviewing files that changed from the base of the PR and between 09bbcbc and 0f124d5.

📒 Files selected for processing (1)

crates/logfwd-transform/tests/json_udf_tests.rs

- Remove redundant bytes dev-dependency (already in [dependencies]) - Remove dead empty-string entries from suffix_order (scanner never emits bare column names) - Replace .unwrap() on downcast with .expect() for clearer diagnostics in Int/Float paths - Add doc comment to parse_raw explaining the NULL _raw row-count mismatch limitation - Fix query() test helper to concat all non-empty batches instead of dropping later batches - Tighten null_raw_row_causes_scanner_mismatch and mixed_valid_invalid_null_rows_errors to assert the specific "scanner row count mismatch" error text, not just is_err() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/logfwd-transform/src/udf/json_extract.rs`:
- Line 245: Fix the typo in the inline comment inside
crates/logfwd-transform/src/udf/json_extract.rs where the comment currently
reads "unparseable → null."; change "unparseable" to "unparsable" so the comment
reads "unparsable → null." (look for the comment in the json_extract.rs code
block that explains parsing strings as i64).
- Around line 221-228: The branch checking suffix.is_empty() in the closure used
to compute col is dead because mode.suffix_order() never yields empty strings;
simplify the closure in json_extract.rs by removing that branch and always
building col_name with format!("{key}{suffix}") (or string concatenation) before
calling batch.column_by_name(&col_name).map(Arc::clone), referencing the
existing closure that assigns to let col and using mode.suffix_order(), key,
suffix, and batch.column_by_name to locate the column.

In `@crates/logfwd-transform/tests/json_udf_tests.rs`:
- Line 173: Replace the non-standard spelling "unparseable" with "unparsable" in
the comment text and the test identifier: update the inline comment that reads
"// _int column first.  The scanner writes 0 for unparseable-as-int strings" to
use "unparsable", and rename the test function whose name contains "unparseable"
(the test at the other flagged location) to use "unparsable" in its identifier
so both the comment and the test name consistently use the standard spelling.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: abade759-35a5-462c-916b-5fb3afab984b

📥 Commits

Reviewing files that changed from the base of the PR and between 0f124d5 and ada0ca8.

📒 Files selected for processing (3)

crates/logfwd-transform/Cargo.toml
crates/logfwd-transform/src/udf/json_extract.rs
crates/logfwd-transform/tests/json_udf_tests.rs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

suffix_order() no longer returns any "" entries so the is_empty() guard is unreachable. Simplify to format!("{key}{suffix}") directly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

strawgate and others added 4 commits April 2, 2026 00:45

coderabbitai Bot requested changes Apr 2, 2026

View reviewed changes

style: cargo fmt on json_udf_tests

0f124d5

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai Bot requested changes Apr 2, 2026

View reviewed changes

Comment thread crates/logfwd-transform/tests/json_udf_tests.rs

Comment thread crates/logfwd-transform/tests/json_udf_tests.rs

coderabbitai Bot requested changes Apr 2, 2026

View reviewed changes

Comment thread crates/logfwd-transform/src/udf/json_extract.rs

Comment thread crates/logfwd-transform/src/udf/json_extract.rs Outdated

Comment thread crates/logfwd-transform/tests/json_udf_tests.rs Outdated

strawgate and others added 2 commits April 2, 2026 01:49

fix: correct spelling unparseable -> unparsable (typos check)

b946ef6

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: remove dead suffix.is_empty() branch in find_map

3515feb

suffix_order() no longer returns any "" entries so the is_empty() guard is unreachable. Simplify to format!("{key}{suffix}") directly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

strawgate merged commit efd8678 into master Apr 2, 2026
4 of 7 checks passed

strawgate deleted the worktree-raw-first branch April 2, 2026 06:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: json/json_int/json_float UDFs for raw-first extraction#645

feat: json/json_int/json_float UDFs for raw-first extraction#645
strawgate merged 8 commits into
masterfrom
worktree-raw-first

strawgate commented Apr 2, 2026

Uh oh!

coderabbitai Bot commented Apr 2, 2026 •

edited

Loading

Review failed

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

strawgate commented Apr 2, 2026

Summary

Motivation

Known limitations (tracked for follow-up)

Test plan

Uh oh!

coderabbitai Bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Apr 2, 2026 •

edited

Loading