fix: Boolean and LargeUtf8 column serialization in JSON and OTLP by strawgate · Pull Request #636 · strawgate/fastforward

strawgate · 2026-04-02T02:31:07Z

Boolean Arrow columns were serializing as empty strings in both JSON (stdout/HTTP) and OTLP sinks. This was because they fell back to a string conversion that only supported Utf8/Utf8View.

This PR:

Fixes str_value to support LargeUtf8.
Fixes write_json_value to explicitly handle Boolean columns, emitting unquoted true/false.
Updates OtlpSink to recognize Boolean columns and encode them using the OTLP bool_value field (confirmed as field 2 via AnyValue protobuf spec and verified with tests).
Adds comprehensive regression tests in crates/logfwd-output/src/lib.rs and crates/logfwd-output/src/otlp_sink.rs.

A few unrelated clippy fixes in crates/logfwd-io/tests/transport_e2e.rs were included to reduce noise during workspace-wide testing.

Fixes #605

PR created automatically by Jules for task 12889585171185946840 started by @strawgate

- Implemented explicit JSON serialization for Boolean columns (previously serialized as empty strings). - Added support for LargeUtf8 columns in output sinks. - Updated OTLP sink to correctly encode Boolean attributes as OTLP bool_value (field 2 in AnyValue). - Added regression tests for both JSON and OTLP output paths. - Fixed unrelated clippy warnings in transport tests to facilitate workspace builds. Co-authored-by: strawgate <6384545+strawgate@users.noreply.github.com>

google-labs-jules · 2026-04-02T02:31:09Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

coderabbitai · 2026-04-02T02:31:22Z

Walkthrough

This PR fixes a correctness issue where Boolean and LargeUtf8 Arrow columns were serializing as empty strings in JSON and OTLP output. Changes include: extending str_value() to support DataType::LargeUtf8 using i64 indexing, adding explicit Boolean handling in JSON serialization to emit true/false instead of falling back to string conversion, introducing a new AttrArray::Bool variant in the OTLP sink with a helper function to encode boolean attributes using protobuf bool_value, and updating tests to reflect the new line-splitting approach and adding coverage for boolean/LargeUtf8 serialization.

Possibly related PRs

fix: output sinks dispatch on Arrow DataType, not column name suffix #568: Both modify JSON/console serialization to dispatch on Arrow DataType for selecting output encoding paths.
audit: fix scan_error metrics, write_row_json bugs, strengthen OTLP proofs #416: Both update JSON serialization logic in crates/logfwd-output/src/lib.rs for handling additional Arrow type variants.
perf: allocation regression tests + cache DataFusion SessionContext (#465) #609: Related through overlapping changes to allocation_regression tests and OTLP encoding additions in otlp_sink.rs.

🚥 Pre-merge checks | ✅ 2

✅ Passed checks (2 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	PR addresses all primary objectives from issue `#605`: Boolean columns now emit unquoted true/false in JSON [lib.rs], OTLP encodes booleans via bool_value [otlp_sink.rs], LargeUtf8 support added [lib.rs], and regression tests included for both sinks.
Out of Scope Changes check	✅ Passed	Changes align with PR scope: Boolean/LargeUtf8 serialization in lib.rs and otlp_sink.rs are on-scope; minor clippy fixes in udp_sink.rs and allocation_regression.rs are acceptable housekeeping.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/logfwd-output/src/lib.rs`:
- Around line 1252-1281: Add assertions for the third row (index 2) in the
boolean_and_large_utf8_serialization test to verify null handling: after
creating the batch with make_batch and using render(&batch, 2), assert that the
parsed JSON (serde_json::Value) has null for both "active" and "note". Update
the test function boolean_and_large_utf8_serialization to call render for row 2
and assert v2["active"].is_null() and v2["note"].is_null() so the test becomes
self-contained while still relying on render and make_batch.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: f2fba268-7c0f-4eff-9e34-934e3f8b0e73

📥 Commits

Reviewing files that changed from the base of the PR and between cc18ff1 and a51fbae.

📒 Files selected for processing (3)

crates/logfwd-io/tests/transport_e2e.rs
crates/logfwd-output/src/lib.rs
crates/logfwd-output/src/otlp_sink.rs

- Implemented explicit JSON serialization for Boolean columns (previously serialized as empty strings). - Added support for LargeUtf8 columns in output sinks. - Updated OTLP sink to correctly encode Boolean attributes as OTLP bool_value (field 2 in AnyValue). - Added regression tests for both JSON and OTLP output paths. - Addressed various workspace clippy warnings. Co-authored-by: strawgate <6384545+strawgate@users.noreply.github.com>

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

crates/logfwd/tests/integration.rs (1)
256-264: ⚠️ Potential issue | 🔴 Critical

Fix mismatched raw string delimiters.

Line 256 opens with r" but line 264 closes with "#. This is a syntax error — either change the opening to r#" or change the closing to " alone.

If using r#"..."# form (recommended for containing inner quotes in YAML), the opening must be r#".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd/tests/integration.rs` around lines 256 - 264, The raw string
literal in the integration test is using mismatched delimiters (it starts with
r" but ends with "#) which causes a syntax error; update the opening delimiter
to r#" so the YAML block uses the r#"..."# raw-string form (the multiline string
shown in the diff) ensuring the opening and closing delimiters match.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/logfwd-output/src/otlp_sink.rs`:
- Around line 726-753: Add a second assertion in the encode_boolean_as_attribute
test to validate encoding of false: create a RecordBatch with a BooleanArray
containing Some(false), call sink.encode_batch(&batch, &make_metadata()) (same
helper functions make_sink/make_metadata), and assert that sink.encoder_buf
contains the byte sequence [0x10u8, 0x00] for the AnyValue bool_value false and
that the attribute key "active" is present (use contains_bytes and encoder_buf
as in the existing true case); this ensures the false varint (0x00) is emitted
correctly.

In `@crates/logfwd-output/tests/allocation_regression.rs`:
- Line 69: The test currently swallows the Result from write_row_json by using
"let _ = write_row_json(&batch, row, &cols, &mut buf);", which can hide
serialization failures and break allocation assertions; fix by either (A)
asserting success immediately (e.g., replace with write_row_json(&batch, row,
&cols, &mut buf).unwrap() or .expect("write_row_json failed")) or (B) change the
test function to return Result and propagate the error with
write_row_json(&batch, row, &cols, &mut buf)?; update all occurrences of the
write_row_json call so failures cause the test to fail instead of being ignored.

---

Outside diff comments:
In `@crates/logfwd/tests/integration.rs`:
- Around line 256-264: The raw string literal in the integration test is using
mismatched delimiters (it starts with r" but ends with "#) which causes a syntax
error; update the opening delimiter to r#" so the YAML block uses the r#"..."#
raw-string form (the multiline string shown in the diff) ensuring the opening
and closing delimiters match.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0f344cc8-6a03-4d12-872f-62aba9672657

📥 Commits

Reviewing files that changed from the base of the PR and between a51fbae and d87b201.

📒 Files selected for processing (22)

crates/logfwd-arrow/src/streaming_builder.rs
crates/logfwd-arrow/tests/allocation_regression.rs
crates/logfwd-bench/benches/pipeline.rs
crates/logfwd-config/src/lib.rs
crates/logfwd-core/benches/scanner.rs
crates/logfwd-core/examples/arrow_ipc_roundtrip.rs
crates/logfwd-core/src/aggregator.rs
crates/logfwd-core/src/pipeline/lifecycle.rs
crates/logfwd-core/tests/compliance_data.rs
crates/logfwd-core/tests/scanner_conformance.rs
crates/logfwd-io/src/generator.rs
crates/logfwd-io/src/metric_history.rs
crates/logfwd-io/src/otlp_receiver.rs
crates/logfwd-output/src/lib.rs
crates/logfwd-output/src/otlp_sink.rs
crates/logfwd-output/src/udp_sink.rs
crates/logfwd-output/tests/allocation_regression.rs
crates/logfwd-transform/src/lib.rs
crates/logfwd/tests/allocation_e2e.rs
crates/logfwd/tests/allocation_scaling.rs
crates/logfwd/tests/compliance_file.rs
crates/logfwd/tests/integration.rs

coderabbitai · 2026-04-02T03:13:06Z

+    #[test]
+    fn encode_boolean_as_attribute() {
+        use arrow::array::BooleanArray;
+        let schema = Arc::new(Schema::new(vec![Field::new(
+            "active",
+            DataType::Boolean,
+            true,
+        )]));
+        let arr = BooleanArray::from(vec![Some(true)]);
+        let batch = RecordBatch::try_new(schema, vec![Arc::new(arr)]).unwrap();
+
+        let mut sink = make_sink();
+        sink.encode_batch(&batch, &make_metadata());
+
+        // LogRecord field 6 tag: (6 << 3) | 2 = 0x32
+        // KeyValue field 1 key tag: (1 << 3) | 2 = 0x0A, then "active"
+        // KeyValue field 2 value AnyValue tag: (2 << 3) | 2 = 0x12
+        // AnyValue field 2 bool_value tag: (2 << 3) | 0 = 0x10, then 0x01
+        let expected = [0x10u8, 0x01];
+        assert!(
+            contains_bytes(&sink.encoder_buf, &expected),
+            "boolean attribute not found in encoded output"
+        );
+        assert!(
+            contains_bytes(&sink.encoder_buf, b"active"),
+            "attribute key 'active' not found"
+        );
+    }


🧹 Nitpick | 🔵 Trivial

Consider testing false value as well.

The test validates true → [0x10, 0x01]. Adding a false case would ensure the 0x00 varint is also emitted correctly. Optional since the encoding logic is straightforward.

Optional: extend test for false value

#[test] fn encode_boolean_as_attribute() { use arrow::array::BooleanArray; let schema = Arc::new(Schema::new(vec![Field::new( "active", DataType::Boolean, true, )])); - let arr = BooleanArray::from(vec![Some(true)]); + let arr = BooleanArray::from(vec![Some(true), Some(false)]); let batch = RecordBatch::try_new(schema, vec![Arc::new(arr)]).unwrap(); let mut sink = make_sink(); sink.encode_batch(&batch, &make_metadata()); - // LogRecord field 6 tag: (6 << 3) | 2 = 0x32 - // KeyValue field 1 key tag: (1 << 3) | 2 = 0x0A, then "active" - // KeyValue field 2 value AnyValue tag: (2 << 3) | 2 = 0x12 - // AnyValue field 2 bool_value tag: (2 << 3) | 0 = 0x10, then 0x01 - let expected = [0x10u8, 0x01]; + // AnyValue field 2 bool_value: tag 0x10, then 0x01 for true + let expected_true = [0x10u8, 0x01]; + let expected_false = [0x10u8, 0x00]; assert!( - contains_bytes(&sink.encoder_buf, &expected), - "boolean attribute not found in encoded output" + contains_bytes(&sink.encoder_buf, &expected_true), + "true boolean not found" + ); + assert!( + contains_bytes(&sink.encoder_buf, &expected_false), + "false boolean not found" );

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/logfwd-output/src/otlp_sink.rs` around lines 726 - 753, Add a second assertion in the encode_boolean_as_attribute test to validate encoding of false: create a RecordBatch with a BooleanArray containing Some(false), call sink.encode_batch(&batch, &make_metadata()) (same helper functions make_sink/make_metadata), and assert that sink.encoder_buf contains the byte sequence [0x10u8, 0x00] for the AnyValue bool_value false and that the attribute key "active" is present (use contains_bytes and encoder_buf as in the existing true case); this ensures the false varint (0x00) is emitted correctly.

coderabbitai · 2026-04-02T03:13:06Z


    for row in 0..batch.num_rows() {
-        write_row_json(&batch, row, &cols, &mut buf);
+        let _ = write_row_json(&batch, row, &cols, &mut buf);


⚠️ Potential issue | 🟠 Major

Do not swallow write_row_json failures in this regression test.
Line 69, Line 75, and Line 82 currently ignore Result. That can let serialization failures pass silently and invalidate allocation assertions.

Proposed fix

- let _ = write_row_json(&batch, row, &cols, &mut buf); + write_row_json(&batch, row, &cols, &mut buf) + .expect("write_row_json should succeed in warmup"); @@ - let _ = write_row_json(&batch, row, &cols, &mut buf); + write_row_json(&batch, row, &cols, &mut buf) + .expect("write_row_json should succeed in measured window 1"); @@ - let _ = write_row_json(&batch, row, &cols, &mut buf); + write_row_json(&batch, row, &cols, &mut buf) + .expect("write_row_json should succeed in measured window 2");

Also applies to: 75-75, 82-82

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@crates/logfwd-output/tests/allocation_regression.rs` at line 69, The test currently swallows the Result from write_row_json by using "let _ = write_row_json(&batch, row, &cols, &mut buf);", which can hide serialization failures and break allocation assertions; fix by either (A) asserting success immediately (e.g., replace with write_row_json(&batch, row, &cols, &mut buf).unwrap() or .expect("write_row_json failed")) or (B) change the test function to return Result and propagate the error with write_row_json(&batch, row, &cols, &mut buf)?; update all occurrences of the write_row_json call so failures cause the test to fail instead of being ignored.

- Add explicit DataType::Boolean handling in JSON serialization to emit unquoted true/false/null. - Add support for DataType::LargeUtf8 in str_value() utility. - Add Boolean attribute support in OTLP sink using AnyValue.bool_value (field 2). - Fix various workspace Clippy warnings and formatting issues to ensure CI compliance. - Add regression tests for Boolean and LargeUtf8 serialization in both sinks. Co-authored-by: strawgate <6384545+strawgate@users.noreply.github.com>

coderabbitai

♻️ Duplicate comments (1)

crates/logfwd-output/src/lib.rs (1)

1256-1285: ⚠️ Potential issue | 🟡 Minor

Add assertions for the null row in this regression test.

Line 1262 and Line 1269 create a third row with None, but the test never validates row 2. That leaves boolean/null regression coverage incomplete.

Suggested patch

     // Row 1: false, "text"
     let json1 = render(&batch, 1);
     let v1: serde_json::Value = serde_json::from_str(&json1).unwrap();
     assert_eq!(v1["active"], false);
     assert_eq!(v1["note"], "text");
+
+    // Row 2: null, null
+    let json2 = render(&batch, 2);
+    let v2: serde_json::Value = serde_json::from_str(&json2).unwrap();
+    assert!(v2["active"].is_null());
+    assert!(v2["note"].is_null());
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@crates/logfwd-output/src/lib.rs` around lines 1256 - 1285, The test
boolean_and_large_utf8_serialization builds a third row with None values but
never validates it; add assertions that render(&batch, 2) produces JSON where
both "active" and "note" are null (e.g., parse the produced string with
serde_json::from_str and assert v2["active"].is_null() and
v2["note"].is_null()). Use the existing render and make_batch calls to locate
the code.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@crates/logfwd-output/src/lib.rs`:
- Around line 1256-1285: The test boolean_and_large_utf8_serialization builds a
third row with None values but never validates it; add assertions that
render(&batch, 2) produces JSON where both "active" and "note" are null (e.g.,
parse the produced string with serde_json::from_str and assert
v2["active"].is_null() and v2["note"].is_null()). Use the existing render and
make_batch calls to locate the code.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 22838c11-029f-436e-80d8-f92186d6a0da

📥 Commits

Reviewing files that changed from the base of the PR and between d87b201 and 82f23e9.

📒 Files selected for processing (1)

crates/logfwd-output/src/lib.rs

google-labs-jules Bot mentioned this pull request Apr 2, 2026

[correctness] Boolean Arrow columns serialize as empty strings in JSON and OTLP sinks #605

Closed

4 tasks

coderabbitai Bot requested changes Apr 2, 2026

View reviewed changes

Comment thread crates/logfwd-output/src/lib.rs

coderabbitai Bot requested changes Apr 2, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 2, 2026

View reviewed changes

strawgate merged commit c439257 into master Apr 2, 2026
7 checks passed

strawgate deleted the fix-boolean-serialization-12889585171185946840 branch April 2, 2026 04:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Boolean and LargeUtf8 column serialization in JSON and OTLP#636

fix: Boolean and LargeUtf8 column serialization in JSON and OTLP#636
strawgate merged 3 commits into
masterfrom
fix-boolean-serialization-12889585171185946840

strawgate commented Apr 2, 2026

Uh oh!

google-labs-jules Bot commented Apr 2, 2026

Uh oh!

coderabbitai Bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 2, 2026

Uh oh!

coderabbitai Bot Apr 2, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

strawgate commented Apr 2, 2026

Uh oh!

google-labs-jules Bot commented Apr 2, 2026

Uh oh!

coderabbitai Bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Apr 2, 2026 •

edited

Loading