Skip to content

fix(scanner): decode JSON escape sequences in string values (#410)#885

Merged
strawgate merged 5 commits into
masterfrom
copilot/fix-json-escape-sequence-bug
Apr 4, 2026
Merged

fix(scanner): decode JSON escape sequences in string values (#410)#885
strawgate merged 5 commits into
masterfrom
copilot/fix-json-escape-sequence-bug

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 4, 2026

Scanner stored raw JSON escape bytes in string values. The output layer then re-escaped backslashes, turning \u00e9\\u00e9 — syntactically valid but semantically wrong JSON. Affects all escape sequences: \", \\, \/, \b, \f, \n, \r, \t, \uXXXX.

Changes

Scanner escape decoding (json_scanner.rs)

  • Added decode_json_escapes covering all RFC 8259 §7 escapes including \uXXXX surrogate pairs
  • Fast path: memchr(b'\\', raw) check — strings without backslashes pass through zero-copy unchanged
  • Slow path: decode into a scratch buffer allocated once per scan_streaming call, reused across lines via clear()

ScanBuilder trait (scanner.rs)

  • Added append_decoded_str_by_idx with default delegation to append_str_by_idx — signals that the value is decoded and may not be a subslice of the input buffer

StreamingBuilder override (streaming_builder.rs)

  • StreamingBuilder uses zero-copy views (offset into input buffer), so decoded bytes can't use offset_of
  • Override stores decoded strings as Vec<u8> in a separate decoded_strs list per field
  • finish_batch merges zero-copy try_append_view entries with decoded append_value entries by row number

Conformance / compliance test updates

  • Removed !actual.contains('\\') guard in oracle comparison — scanner now matches sonic-rs for all strings including those with escapes
  • Updated compliance_special_chars_in_values and compliance_escaped_quotes_in_strings to expect decoded characters
  • Added \uXXXX BMP code points to arb_json_string proptest generator
// Before: raw bytes passed through, double-escaped downstream
builder.append_str_by_idx(idx, &buf[val_start..val_end]);

// After: decode escapes, use appropriate append method
if memchr::memchr(b'\\', raw).is_some() {
    decode_json_escapes(raw, scratch);
    builder.append_decoded_str_by_idx(idx, scratch);
} else {
    builder.append_str_by_idx(idx, raw);
}

Copilot AI linked an issue Apr 4, 2026 that may be closed by this pull request
3 tasks
Copilot AI and others added 2 commits April 4, 2026 03:25
Decode all RFC 8259 §7 escape sequences (\", \\, \/, \b, \f, \n, \r,
\t, \uXXXX including surrogate pairs) during string extraction in
json_scanner. Prevents double-escaping when values are re-serialized
downstream.

Fast path: strings without backslashes pass through zero-copy.
Slow path: decode into a scratch buffer reused across lines.

StreamingBuilder gains append_decoded_str_by_idx for decoded strings
that are not subslices of the input buffer (uses StringViewBuilder's
append_value copy path instead of try_append_view).

Agent-Logs-Url: https://github.com/strawgate/memagent/sessions/4a1bca50-db56-4902-b392-7969d2968d8a

Co-authored-by: strawgate <6384545+strawgate@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix P1 data corruption for JSON escape sequences fix(scanner): decode JSON escape sequences in string values (#410) Apr 4, 2026
Copilot AI requested a review from strawgate April 4, 2026 03:41
@strawgate strawgate marked this pull request as ready for review April 4, 2026 04:02
@strawgate
Copy link
Copy Markdown
Owner

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 4, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 4, 2026

Warning

Rate limit exceeded

@strawgate has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 25 minutes and 28 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 25 minutes and 28 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1b3768c9-7299-49f1-a277-a1b9196b477a

📥 Commits

Reviewing files that changed from the base of the PR and between 26cee6e and 880d27f.

📒 Files selected for processing (7)
  • crates/logfwd-arrow/src/scanner.rs
  • crates/logfwd-arrow/src/streaming_builder.rs
  • crates/logfwd-core/src/json_scanner.rs
  • crates/logfwd-core/src/scanner.rs
  • crates/logfwd-core/tests/compliance_data.rs
  • crates/logfwd-core/tests/scanner_conformance.rs
  • crates/logfwd-test-utils/src/json.rs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

strawgate and others added 2 commits April 3, 2026 23:04
Add `caf` to the typos allowlist — it is a partial string in
`caf\u00e9` (café) used in a unicode escape test, not a misspelling.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ue mismatch

The escape decoding change (#410) stored decoded strings in a separate
`decoded_strs` vector per field, while regular strings went into
`str_views`. When duplicate keys beyond index 63 escaped the
`check_dup_bits` guard, the two vectors could hold conflicting entries
for the same row. The `finish_batch` interleaving logic always checked
`str_views` first, causing the second (duplicate) value to win instead
of the first, diverging from StorageBuilder's first-writer-wins
semantics.

Fix: eliminate `decoded_strs` entirely. Decoded bytes are appended to a
builder-level `decoded_buf`, and their offsets (shifted by `buf.len()`)
are stored in the same `str_views` vector as regular strings. At
`finish_batch` time, if any decoded strings exist, a combined Arrow
buffer is created by concatenating the original input buffer with
`decoded_buf`; otherwise the original zero-copy path is preserved.

This ensures insertion order is maintained in a single vector, so
first-writer-wins naturally applies for duplicate keys regardless of
field index.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@strawgate strawgate merged commit 999c93a into master Apr 4, 2026
5 of 6 checks passed
@strawgate strawgate deleted the copilot/fix-json-escape-sequence-bug branch April 4, 2026 04:49
strawgate added a commit that referenced this pull request Apr 11, 2026
Systematic audit found 30 potential issues from agent-authored code.
After parallel verification by 11 subagents, 13 were confirmed real,
8 were by-design, 5 were won't-fix, and 5 were false positives.
This commit fixes all 12 actionable confirmed findings.

**High severity:**
- Fix OtherStr panic: OTLP sink crashed on non-string attribute types
  (e.g., hash() UDF returning UInt64). Replaced unreachable!() with
  array_value_to_string(). Removed dead str_value() function. (#7)
- Fix silent struct drop: non-conflict Struct columns now log a warning
  before being skipped, matching the resource struct behavior. (#6)

**Medium severity:**
- Fix scanner contract drift: SCANNER_CONTRACT.md said "no escape
  decoding" but implementation decodes since PR #885. Updated doc. (#19)
- Deduplicate calendar math: made core's Kani-verified days_from_civil
  public; arrow's wrapper now delegates instead of reimplementing. (#21)
- Centralize metadata keys: added METADATA_RESOURCE_KEY and
  METADATA_RESOURCE_PREFIX constants to field_names.rs, replacing 15
  bare string literals across 4 files / 3 crates. (#15)
- Add TypedColumn::Bytes variant: OTAP bytes attributes now round-trip
  as BinaryArray instead of being hex-encoded to strings. (#16)

**Low severity:**
- Deduplicate WELL_KNOWN arrays: star_schema.rs now delegates to
  field_names::matches_any() instead of maintaining a local copy. Added
  logfwd-types dependency to logfwd-arrow. (#13)
- Centralize _raw column name: added field_names::RAW constant. (#12)
- Extract MAX_REQUEST_BODY_SIZE: shared constant in receiver_http.rs
  replaces 3 independent definitions. (#27)
- Import DEFAULT_RETRY_AFTER_SECS: otap_sink and arrow_ipc_sink now
  import from http_classify instead of redefining. (#29)
- Name timing defaults: pipeline build.rs and input_build.rs now use
  named constants instead of inline unwrap_or literals. (#30)
- Add timestamp diagnostic: tracing::debug!() on timestamp parse
  fallback for operator visibility. (#1)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

work-unit: logfwd-core/scanner — P1 JSON escape fix

2 participants