-
Notifications
You must be signed in to change notification settings - Fork 14
Properly handle UTF-8 labels #117
Copy link
Copy link
Open
Labels
acceptance: go aheadReviewed, implementation can startReviewed, implementation can startarea: appImprovements in overall CLI app usabilityImprovements in overall CLI app usabilitytype: featureNew feature or requestNew feature or request
Milestone
Metadata
Metadata
Assignees
Labels
acceptance: go aheadReviewed, implementation can startReviewed, implementation can startarea: appImprovements in overall CLI app usabilityImprovements in overall CLI app usabilitytype: featureNew feature or requestNew feature or request
Type
Projects
Status
In Progress
Is your feature request related to a problem? Please describe.
The engine currently violates the JSON spec by not normalizing Unicode escapes. We do this for performance purposes, since ordinal comparison can be easily SIMDified, but it's not correct.
For a simple example, the UTF-8 codepoint for the letter "a" is 0x0061. These JSONs are equivalent under RFC 8259:
{"a":42}{"\u0061":42}Therefore the query
$["a"]should in both cases match the value42.Quite sensibly, and indeed officially under the current JSONPath RFC Draft, the queries
$["a"]and$["\u0061"]must also be equivalent. All four combinations of the two documents above and the two queries must yield the same result -- the value42.Describe the solution you'd like
The tradeoff here is important. We expect the difference in performance to be staggering, especially since the
head-skipoptimisation is by design incompatible with this. We need a flag that will toggle this behaviour. I propose we make this the optional behaviour – we expect the vast majority of labels to be ASCII, if a user wants to match unicode they can use the flag.