Change to use WTF-8/WTF-16 positions instead of abstract cursors#26
Conversation
This patch switches from using implementation-specified i32 cursors to denote positions in strings via WTF-8 and WTF-16 offsets. Using the WTF-8 interface treats strings as sequences of codepoints, whereas the WTF-16 interface treats them as WTF-16 code units. The examples and FAQ are not yet updated, however.
|
Thanks! Looks like updated semantics and instruction set (assuming having GC array <-> However, when targeting WTF-8 embeddings, it is currently unclear if compatibility with host strings would justify cost of non-native |
Co-authored-by: Ms2ger <Ms2ger@gmail.com>
Co-authored-by: Ms2ger <Ms2ger@gmail.com>
|
OK this is better than the old "it's an i32 but you can't trust it" approach. Let's land and iterate! |
jakobkummerow
left a comment
There was a problem hiding this comment.
A couple of comments. (Sorry that it took me a while to find sufficient time to review this properly, it's a large diff.)
No objections to having merged it, of course.
|
|
||
| Encode the contents of the string *`str`* as UTF-8 or WTF-8, | ||
| respectively, starting at the WTF-8 byte offset *`pos`*, to memory at | ||
| *`ptr`*, limited to *`bytes`* bytes. Return the number of bytes |
There was a problem hiding this comment.
The ptr:address parameter was lost in the edit.
There was a problem hiding this comment.
Firstly, thank you thank you for the close review! Will follow up in a new PR. Good catch here.
| Encode the contents of the string *`str`* as UTF-8 or WTF-8, | ||
| respectively, starting at the WTF-8 byte offset *`pos`*, to memory at | ||
| *`ptr`*, limited to *`bytes`* bytes. Return the number of bytes | ||
| written. Note that no `NUL` terminator is ever written. If any |
There was a problem hiding this comment.
Would it make sense to limit the maximum output to 2^31-2 (instead of ...-1) so that callers can add their own NUL byte (if desired) and still remain in int32 range for the overall length?
There was a problem hiding this comment.
I think that given that you would write the NUL with i32.store_u8 whose memarg operand takes a u32 offset (https://webassembly.github.io/spec/core/syntax/instructions.html#syntax-memarg) that perhaps no change is needed here. Provided that ptr < 2^31, you can always write 2^31-1 bytes and then (i32.store_u8 offset:2^31 alighn:1) relative to pos. LMK if I am misunderstanding the issue though.
There was a problem hiding this comment.
(also you could pass 2^31-2 as the bytes operand, if that's what you wanted)
|
|
||
| In this MVP, it is not possible to produce a string which is not a valid | ||
| sequence of unicode scalar values. | ||
| All instructions which take string positions trap if the position is not |
There was a problem hiding this comment.
I'm worried that this might be expensive.
IIUC, it's not linear in complexity, because in both WTF-16 and WTF-8 code units can be classified in O(1) as being a non-first element of a surrogate pair/chain or not. Still, there's a cost to doing this check for every access. We might just have to prototype it and see if that cost is acceptable.
There was a problem hiding this comment.
Good question! Related to #10.
I think that for WTF-16 things are easy -- a valid position is one that's in range. I think we have to check for this already so there's no additional work. Do lmk if I am not thinking straight. Note that the WTF-16 interfaces index over code units and not codepoints -- i.e. there's no special requirement forbidding a position value which is the offset of the low surrogate of a surrogate pair.
For WTF-8, we have the range check (no additional work even on WTF-16 hosts) and then the codepoint check. Agreed that this is a cost. Let's keep thinking about this in #10.
| The `string.const` section indicates the literal as an `i32` index into | ||
| a new custom section: a string table, encoded as a `vec(vec(u8))` of | ||
| valid UTF-8 strings. Because literal strings can contain codepoint 0, | ||
| valid WTF-8 strings. Because literal strings can contain codepoint 0, |
There was a problem hiding this comment.
https://simonsapin.github.io/wtf-8/#intended-audience: "WTF-8 must not be used [...] for transmission over the Internet."
Bummer, but arguably finding a viable solution for the constraints we're facing in Wasm outweighs what some other spec did or didn't mean itself to be used for.
There was a problem hiding this comment.
Good catch!! Filed #29, just so that we can formally decide one way ot the other.
| Create a new string from the *`bytes`* WTF-8 bytes in memory at *`ptr`*. | ||
| Out-of-bounds access will trap. Attempting to create a string with | ||
| invalid WTF-8 will trap. The maximum value for *`bytes`* is | ||
| 2<sup>31</sup>–1; passing a higher value traps. |
There was a problem hiding this comment.
Standardizing the max string length (in particular: in the Wasm-JS API spec) might be challenging. In V8, the current situation is that we allow up to (1 << 28) - 16 characters on 32-bit builds and up to (1 << 29) - 24 characters on 64-bit builds. The latter cannot be increased without massive implementation effort; I'm not sure whether the former could be increased to match the latter. I wouldn't be surprised if other browsers similarly had hard constraints defined by their implementation choices. The Wasm-JS API so far has been able to standardize precise, implementation-independent limits for everything -- but so far, no Wasm feature required this kind of tight alignment with existing implementations of JS features and their limits.
There was a problem hiding this comment.
Related investigation into string lengths: #12 (comment). I was very surprised that JSC allows strings up to 2^31-1 code units!! But OK.
I think the intention here is to allow WebAssembly to address any string that comes from the host. However when it comes to creating strings, unlike pre-GC WebAssembly, there is the possibility of dynamic allocation failure. Any string.new can fail, regardless of the length. This brings an inherent source of non-uniformity into the execution of a WebAssembly program but seems to be essential to the domain. On the other hand dynamic allocation failure can provide cover for implementation-specific limits; see also https://github.com/wingo/stringrefs#stringnew-size-limits. WDYT?
There was a problem hiding this comment.
Well, it's not technically all the same: if an attempt to allocate a (small) string fails due to OOM, we will crash the process (not trap). If an allocation attempt fails because of a (spec or implementation defined) limit, we would at least have the option to trap instead. But I don't have a better suggestion that just relying on the generic "any allocation can fail" rule here.
| less than the *`codepoints`* parameter if end-of-string or | ||
| beginning-of-string would be reached. The *`codepoints`* argument is | ||
| interpreted as a `u32`. | ||
| ### String positions |
There was a problem hiding this comment.
I'm wondering whether we can get away with not having cursors/positions at all, and instead entirely rely on string.encode_*ing the string's contents to memory (or a WasmGC array). Source languages that must offer an implementation of character-at would compile this to the creation of a temporary copy of the string, on which they can then perform whatever encoding-specific cursor manipulations they want. If we assume that character-at is typically used for iterating over (most of) an entire string, then this explicit conversion is probably only slightly more expensive than the currently proposed position-based approach (especially considering position validity checking), so that might be acceptable at least for an MVP. Creating string slices would then also involve dumping the source string to memory and string.new_*ing the desired part back into a string; I don't have a good intuition for how frequent this is.
To be clear, I'm not trying to argue for this change, just putting the idea out there. I'm honestly not sure about its feasibility.
There was a problem hiding this comment.
Yeah good questions. I think that we certainly need get_wtf16, do you agree? A program in a source language that already expose u16 code unit access can't be easily transformed by a compiler to work on a copy without introducing algorithmic penalties.
I think users of the WTF-8 interfaces either actually want UTF-8, for sending a string somewhere, or are actually treating the string as a sequence of codepoints and need a way to denote a position in a string, be it for indicating which part of a string to slice / copy / compare or to access an individual codepoint. To some source languages, having that position be WTF-8 is almost immaterial -- they really just want codepoints, and the WTF-8 interface is the one that gives codepoints. Some source languages will care though, e.g. Rust which stores string lengths internally as UTF-8 byte lengths, and any source language that needs to encode WTF-8 bytes to memory. Related to #27.
I think we will need to iterate more here...
There was a problem hiding this comment.
Define "algorithmic penalty" -- with the scheme I described, a single user-space get_wtf16 replacement would get more expensive (O(n)), but if you assume they usually happen in a loop over the string, then the loop's overall asymptotic complexity would remain O(n).
At any rate, this discussion probably deserves its own issue.
There was a problem hiding this comment.
I guess #31 would be this issue. Thanks for filing that!
|
|
||
| ``` | ||
| (string.eq a:stringref a-cursor:i32 b:stringref b-cursor:i32 codepoints:i32) -> i32 | ||
| (string.eq_wtf8 a:stringref a-pos:i32 b:stringref b-pos:i32 codepoints:i32) -> i32 |
There was a problem hiding this comment.
It might also be nice to have a (string.eq a:stringref b:stringref), for what's likely the most common case of string comparisons: comparing entire strings. By skipping positions and length, this can be encoding-independent, which would avoid the issue that the encoding-specific variants will likely take a performance penalty on engines on which the respective encoding is foreign.
There was a problem hiding this comment.
This is equivalent to (string.eq_wtf8 a 0 b 0 -1) or (string.eq_wtf16 a 0 b 0 -1), so it would just be an abbreviation and not imply transcoding. Is that good enough? Something is still not quite right here though, feels weird to have these two interfaces.
There was a problem hiding this comment.
I think what feels weird is that comparing two strings for equality conceptually shouldn't care about their encoding (because the same string could be encoded either way). But since the positions used as starting points have encoding-specific meaning, the current design needs the variants. Theoretically we'd even need at least one mixed variant, but I suppose we assume that wanting to mix wtf-8 and wtf-16 positions (a: assumed-wtf8-stringref, a-pos:wtf8-codepoint-index, b: assumed-wtf16-string, b-pos: wtf16-codeunit-index) won't happen in practice?
| also want to implement a map from UTF-16 position to UTF-8 position via | ||
| [breadcrumbs](https://www.swift.org/blog/utf8-string/#breadcrumbs). | ||
| UTF-8 position validation is ensuring the cursor is less than or equal | ||
| to the string byte length, and that `(ptr[cursor] & 0xb0) != 0x80`. |
|
|
||
| We expect that compilers that emit the WTF-16 interface place more | ||
| importance on `string.get_wtf16`. Implementations should ensure that | ||
| `string.get_wtf16` runs in near-linear time, even on systems that |
There was a problem hiding this comment.
Do you mean "near-constant"? Linear time for a single get would be pretty slow.
There was a problem hiding this comment.
Hah, right, a thinko on my part -- will fix.
| ### Could abstract the concept of a string position? | ||
|
|
||
| The question is, if we see strings as sequences of codepoints that can | ||
| be seeked around in, what if we defined an abstract time for a cursor |
This commit fixes some nits after #26 was merged.
This patch switches from using implementation-specified i32 cursors to denote positions in strings via WTF-8 and WTF-16 offsets. Using the WTF-8 interface treats strings as sequences of codepoints, whereas the WTF-16 interface treats them as WTF-16 code units. The examples and FAQ are also updated.
See #21, #23, #24.