Skip to content

Change representation of string cursors to just i32#9

Merged
wingo merged 1 commit into
mainfrom
no-cursors
Oct 26, 2021
Merged

Change representation of string cursors to just i32#9
wingo merged 1 commit into
mainfrom
no-cursors

Conversation

@wingo
Copy link
Copy Markdown
Owner

@wingo wingo commented Oct 22, 2021

See the discussion in #6.

Fixes #6 and #8.

@wingo wingo linked an issue Oct 22, 2021 that may be closed by this pull request
@wingo
Copy link
Copy Markdown
Owner Author

wingo commented Oct 25, 2021

A couple questions still to address here:

  • Nondeterminism. We should avoid creating a situation where a WebAssembly program will behave differently on $browser1 versus $browser2. Should we go farther than encouraging embeddings to define cursor meanings, and formally require them to do so?
  • Programmer sloppiness. With opaque cursors, the set of valid cursors is only discoverable via string.advance / string.rewind / string.encode, and this behavior doesn't depend on the contents of the string. With i32 cursors, you could just increment the cursor from 0 and it would work for ASCII strings, for the suggested meanings of cursors (code units or byte offset), but trap for strings with codepoints above some limit. Need to discuss this.

@jakobkummerow
Copy link
Copy Markdown

I think I prefer this over the initial version, because it makes things more explicit, and allows for simpler engines.

To get opacity of cursor values, an option would be to introduce an opaque type for them, which (contrary to the "stringcursor is a 3-tuple" approach) is designed such that engines can implement it by using a plain integer under the hood; in other words: stick with the unboxing currently drafted here, but replace a few occurrences of i32 with stringcursor. (As a hacky alternative that gets by without a new type, externref could be used as cursor type -- it wouldn't actually be a reference, but it would be opaque.) I'm not sure whether that's worth it though: Wasm is intended to be tool-generated, not human-written, so we don't really need guardrails against human coding errors; there are plenty of existing precedents of plain i32 values that have specific semantic meaning in the context where they're used, e.g. performing random i32 arithmetic on memory offsets typically doesn't make much sense either.

The string.measure instruction, instead of returning a pair bytes:i32, valid:i32 which is either (num_bytes, 1) or (0, 0), could also return a single bytes:i32 where -1 indicates "invalid". Benefit: slightly simpler; disadvantage: makes it impossible to handle strings taking more than 2 GiB (whereas with the pair, the bytes value could be interpreted as u32, allowing up to 4 GiB strings; though it remains to be seen whether other constraints will get in the way, ruling that out anyway).

I am a bit worried about cursor validity checks, especially checking for cursors pointing at the second half of a surrogate pair. I expect that well-formed, bug-free modules will never use such cursors (so it would be sad if engines were forced to spend lots of CPU cycles on this check), but we do have to specify what happens if a module does create that situation. I think it would be best if we silently treated such a second-half-of-a-surrogate-pair like a lone surrogate (which is probably the behavior that would arise from an implementation that doesn't specifically check for this case).

@wingo
Copy link
Copy Markdown
Owner Author

wingo commented Oct 26, 2021

Thanks for the feedback @jakobkummerow ! I think given the general OK, I will take the opaque-cursor, string-measure, and validity-check-cost questions to the separate issues -- they certainly need a good answer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Provide "current codepoint" accessor? Can we avoid string cursors?

2 participants