fix(scanner): decode JSON escape sequences in string values (#410)#885
Conversation
Decode all RFC 8259 §7 escape sequences (\", \\, \/, \b, \f, \n, \r, \t, \uXXXX including surrogate pairs) during string extraction in json_scanner. Prevents double-escaping when values are re-serialized downstream. Fast path: strings without backslashes pass through zero-copy. Slow path: decode into a scratch buffer reused across lines. StreamingBuilder gains append_decoded_str_by_idx for decoded strings that are not subslices of the input buffer (uses StringViewBuilder's append_value copy path instead of try_append_view). Agent-Logs-Url: https://github.com/strawgate/memagent/sessions/4a1bca50-db56-4902-b392-7969d2968d8a Co-authored-by: strawgate <6384545+strawgate@users.noreply.github.com>
…go fmt Agent-Logs-Url: https://github.com/strawgate/memagent/sessions/4a1bca50-db56-4902-b392-7969d2968d8a Co-authored-by: strawgate <6384545+strawgate@users.noreply.github.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 25 minutes and 28 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Repository YAML (base), Organization UI (inherited) Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (7)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Add `caf` to the typos allowlist — it is a partial string in `caf\u00e9` (café) used in a unicode escape test, not a misspelling. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ue mismatch The escape decoding change (#410) stored decoded strings in a separate `decoded_strs` vector per field, while regular strings went into `str_views`. When duplicate keys beyond index 63 escaped the `check_dup_bits` guard, the two vectors could hold conflicting entries for the same row. The `finish_batch` interleaving logic always checked `str_views` first, causing the second (duplicate) value to win instead of the first, diverging from StorageBuilder's first-writer-wins semantics. Fix: eliminate `decoded_strs` entirely. Decoded bytes are appended to a builder-level `decoded_buf`, and their offsets (shifted by `buf.len()`) are stored in the same `str_views` vector as regular strings. At `finish_batch` time, if any decoded strings exist, a combined Arrow buffer is created by concatenating the original input buffer with `decoded_buf`; otherwise the original zero-copy path is preserved. This ensures insertion order is maintained in a single vector, so first-writer-wins naturally applies for duplicate keys regardless of field index. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Systematic audit found 30 potential issues from agent-authored code. After parallel verification by 11 subagents, 13 were confirmed real, 8 were by-design, 5 were won't-fix, and 5 were false positives. This commit fixes all 12 actionable confirmed findings. **High severity:** - Fix OtherStr panic: OTLP sink crashed on non-string attribute types (e.g., hash() UDF returning UInt64). Replaced unreachable!() with array_value_to_string(). Removed dead str_value() function. (#7) - Fix silent struct drop: non-conflict Struct columns now log a warning before being skipped, matching the resource struct behavior. (#6) **Medium severity:** - Fix scanner contract drift: SCANNER_CONTRACT.md said "no escape decoding" but implementation decodes since PR #885. Updated doc. (#19) - Deduplicate calendar math: made core's Kani-verified days_from_civil public; arrow's wrapper now delegates instead of reimplementing. (#21) - Centralize metadata keys: added METADATA_RESOURCE_KEY and METADATA_RESOURCE_PREFIX constants to field_names.rs, replacing 15 bare string literals across 4 files / 3 crates. (#15) - Add TypedColumn::Bytes variant: OTAP bytes attributes now round-trip as BinaryArray instead of being hex-encoded to strings. (#16) **Low severity:** - Deduplicate WELL_KNOWN arrays: star_schema.rs now delegates to field_names::matches_any() instead of maintaining a local copy. Added logfwd-types dependency to logfwd-arrow. (#13) - Centralize _raw column name: added field_names::RAW constant. (#12) - Extract MAX_REQUEST_BODY_SIZE: shared constant in receiver_http.rs replaces 3 independent definitions. (#27) - Import DEFAULT_RETRY_AFTER_SECS: otap_sink and arrow_ipc_sink now import from http_classify instead of redefining. (#29) - Name timing defaults: pipeline build.rs and input_build.rs now use named constants instead of inline unwrap_or literals. (#30) - Add timestamp diagnostic: tracing::debug!() on timestamp parse fallback for operator visibility. (#1) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Scanner stored raw JSON escape bytes in string values. The output layer then re-escaped backslashes, turning
\u00e9→\\u00e9— syntactically valid but semantically wrong JSON. Affects all escape sequences:\",\\,\/,\b,\f,\n,\r,\t,\uXXXX.Changes
Scanner escape decoding (
json_scanner.rs)decode_json_escapescovering all RFC 8259 §7 escapes including\uXXXXsurrogate pairsmemchr(b'\\', raw)check — strings without backslashes pass through zero-copy unchangedscan_streamingcall, reused across lines viaclear()ScanBuildertrait (scanner.rs)append_decoded_str_by_idxwith default delegation toappend_str_by_idx— signals that the value is decoded and may not be a subslice of the input bufferStreamingBuilderoverride (streaming_builder.rs)StreamingBuilderuses zero-copy views (offset into input buffer), so decoded bytes can't useoffset_ofVec<u8>in a separatedecoded_strslist per fieldfinish_batchmerges zero-copytry_append_viewentries with decodedappend_valueentries by row numberConformance / compliance test updates
!actual.contains('\\')guard in oracle comparison — scanner now matches sonic-rs for all strings including those with escapescompliance_special_chars_in_valuesandcompliance_escaped_quotes_in_stringsto expect decoded characters\uXXXXBMP code points toarb_json_stringproptest generator