Properly handle UTF-8 labels

**Is your feature request related to a problem? Please describe.**
The engine currently violates the JSON spec by not normalizing Unicode escapes. We do this for performance purposes, since ordinal comparison can be easily SIMDified, but it's not correct.

For a simple example, the UTF-8 codepoint for the letter "a" is 0x0061. These JSONs _are_ equivalent under [RFC 8259](https://www.rfc-editor.org/rfc/rfc8259#section-8.2):

```json
{"a":42}
```
```json
{"\u0061":42}
```

Therefore the query `$["a"]` should **in both cases** match the value `42`.

Quite sensibly, and indeed officially under the current [JSONPath RFC Draft](https://www.ietf.org/archive/id/draft-ietf-jsonpath-base-12.html#name-overview), the queries `$["a"]` and `$["\u0061"]` must also be equivalent. All four combinations of the two documents above and the two queries must yield the same result -- the value `42`.

**Describe the solution you'd like**
The tradeoff here is important. We expect the difference in performance to be staggering, especially since the `head-skip` optimisation is by design incompatible with this. We need a flag that will toggle this behaviour. I propose we make this the **optional** behaviour &ndash; we expect the vast majority of labels to be ASCII, if a user wants to match unicode they can use the flag.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properly handle UTF-8 labels #117

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Properly handle UTF-8 labels #117

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions