Skip to content

If cursor validity checks are expensive then we should eliminate them #10

@wingo

Description

@wingo

@jakobkummerow notes on #9:

I am a bit worried about cursor validity checks, especially checking for cursors pointing at the second half of a surrogate pair. I expect that well-formed, bug-free modules will never use such cursors (so it would be sad if engines were forced to spend lots of CPU cycles on this check), but we do have to specify what happens if a module does create that situation. I think it would be best if we silently treated such a second-half-of-a-surrogate-pair like a lone surrogate (which is probably the behavior that would arise from an implementation that doesn't specifically check for this case).

If indeed cursor validity checks are expensive, then we should consider other designs. To answer this, we need to know:

  • What do cursor validity checks entail?
  • How often do they occur? If costs are expensive and checks are frequent, we have a problem.
  • What would we lose if we dropped checks? If checks provide little benefit and there are coherent permissive approaches to "invalid" cursors then we should consider them.

What's in a cursor validity check

Consider that you have a browser host and WebAssembly strings are JS strings. To check validity of a cursor, you would need to:

  1. If cursor is > String.length, trap
  2. If cursor is == String.length, then we are done: cursor at end of string
  3. The cursor is in bounds. Check that we're not pointing at the low surrogate of a surrogate pair:
    1. If the string is narrow (one byte per codepoint), the cursor is valid and we are done.
    2. Otherwise get the code unit in the string at the cursor.
    3. If (codeUnit & 0xf800) != 0xd800 then it's not a surrogate, and the cursor is valid
    4. Otherwise, the cursor points at a surrogate. If it is a high surrogate then we are done and the cursor is valid. Note, this high surrogate might be part of a pair, or it might be isolated.
    5. The code unit is a low surrogate. If cursor is > 0 and str[cursor-1] is a high surrogate then the cursor is invalid; trap.
    6. Otherwise we have an isolated low surrogate

How frequent are surrogate checks

The suggestion is to skip all the substeps of 3: any cursor which is in range is valid, whether it points to a non-surrogate, a high surrogate of a surrogate pair, or an isolated surrogate. However, if we consider the semantics of any of the instructions that take cursors, we actually need all checks but 3.iv / 3.v to handle surrogate pairs, and in the high surrogate case we need more steps to handle any subsequent low surrogate.

I am thinking that check overhead is not high, given:

  1. we already have to check for high surrogates to process the current codepoint (whether it is one code unit or two)
  2. once you know you have a surrogate, the cost of the check is mostly born by cursors that point to low surrogates, and isolated or not these are very very rare
  3. we expect users to use multi-codepoint instructions, so validation cost is amortized
  4. narrow strings are common even in non-latin1 languages (because e.g. html) and don't need these checks at all

What would we lose if we relaxed validity checks

Given that we can define a meaning for in-range but invalid cursors on JS strings, what do checks buy us?

I think the main answer is consistency across the platform. Probably you would want the string API to behave the same for a given USV sequence, whether the host used UTF-16, UTF-8, or whatever. I assume we would want a host that uses UTF-8 to trap if the cursor is in bounds but not at the start of a codepoint. It would be surprising if traversing the same string in the same way on a UTF-16 host would produce a different result.

Conclusion

From what I can see the cost is minimal and the benefit is small but real, so I would be inclined to keep the checks. If either of these premises are false we should certainly remove them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions