Skip to content

Change to use WTF-8/WTF-16 positions instead of abstract cursors#26

Merged
wingo merged 5 commits into
mainfrom
native-utf-8-and-utf-16
Feb 9, 2022
Merged

Change to use WTF-8/WTF-16 positions instead of abstract cursors#26
wingo merged 5 commits into
mainfrom
native-utf-8-and-utf-16

Conversation

@wingo
Copy link
Copy Markdown
Owner

@wingo wingo commented Feb 7, 2022

This patch switches from using implementation-specified i32 cursors to denote positions in strings via WTF-8 and WTF-16 offsets. Using the WTF-8 interface treats strings as sequences of codepoints, whereas the WTF-16 interface treats them as WTF-16 code units. The examples and FAQ are also updated.

See #21, #23, #24.

This patch switches from using implementation-specified i32 cursors to
denote positions in strings via WTF-8 and WTF-16 offsets.  Using the
WTF-8 interface treats strings as sequences of codepoints, whereas the
WTF-16 interface treats them as WTF-16 code units.  The examples and FAQ
are not yet updated, however.
@wingo
Copy link
Copy Markdown
Owner Author

wingo commented Feb 7, 2022

cc @jakobkummerow, @skuzmich

Comment thread README.md Outdated
@skuzmich
Copy link
Copy Markdown

skuzmich commented Feb 7, 2022

Thanks! Looks like updated semantics and instruction set (assuming having GC array <-> stringref conversions in the future) would make stringref a good type for core Kotlin String when targeting embeddings with WTF-16 strings.

However, when targeting WTF-8 embeddings, it is currently unclear if compatibility with host strings would justify cost of non-native stringref.get_wtf16, since proposal mentions breadcrumbs and single-slot cache among acceptable solutions.

Comment thread README.md Outdated
Comment thread README.md Outdated
Comment thread README.md Outdated
wingo and others added 3 commits February 9, 2022 14:05
Co-authored-by: Ms2ger <Ms2ger@gmail.com>
Co-authored-by: Ms2ger <Ms2ger@gmail.com>
@wingo
Copy link
Copy Markdown
Owner Author

wingo commented Feb 9, 2022

OK this is better than the old "it's an i32 but you can't trust it" approach. Let's land and iterate!

@wingo wingo merged commit 6bd5f6d into main Feb 9, 2022
@wingo wingo deleted the native-utf-8-and-utf-16 branch February 9, 2022 14:11
Copy link
Copy Markdown

@jakobkummerow jakobkummerow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments. (Sorry that it took me a while to find sufficient time to review this properly, it's a large diff.)

No objections to having merged it, of course.

Comment thread README.md

Encode the contents of the string *`str`* as UTF-8 or WTF-8,
respectively, starting at the WTF-8 byte offset *`pos`*, to memory at
*`ptr`*, limited to *`bytes`* bytes. Return the number of bytes
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ptr:address parameter was lost in the edit.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, thank you thank you for the close review! Will follow up in a new PR. Good catch here.

Comment thread README.md
Encode the contents of the string *`str`* as UTF-8 or WTF-8,
respectively, starting at the WTF-8 byte offset *`pos`*, to memory at
*`ptr`*, limited to *`bytes`* bytes. Return the number of bytes
written. Note that no `NUL` terminator is ever written. If any
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to limit the maximum output to 2^31-2 (instead of ...-1) so that callers can add their own NUL byte (if desired) and still remain in int32 range for the overall length?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that given that you would write the NUL with i32.store_u8 whose memarg operand takes a u32 offset (https://webassembly.github.io/spec/core/syntax/instructions.html#syntax-memarg) that perhaps no change is needed here. Provided that ptr < 2^31, you can always write 2^31-1 bytes and then (i32.store_u8 offset:2^31 alighn:1) relative to pos. LMK if I am misunderstanding the issue though.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also you could pass 2^31-2 as the bytes operand, if that's what you wanted)

Comment thread README.md

In this MVP, it is not possible to produce a string which is not a valid
sequence of unicode scalar values.
All instructions which take string positions trap if the position is not
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that this might be expensive.
IIUC, it's not linear in complexity, because in both WTF-16 and WTF-8 code units can be classified in O(1) as being a non-first element of a surrogate pair/chain or not. Still, there's a cost to doing this check for every access. We might just have to prototype it and see if that cost is acceptable.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! Related to #10.

I think that for WTF-16 things are easy -- a valid position is one that's in range. I think we have to check for this already so there's no additional work. Do lmk if I am not thinking straight. Note that the WTF-16 interfaces index over code units and not codepoints -- i.e. there's no special requirement forbidding a position value which is the offset of the low surrogate of a surrogate pair.

For WTF-8, we have the range check (no additional work even on WTF-16 hosts) and then the codepoint check. Agreed that this is a cost. Let's keep thinking about this in #10.

Comment thread README.md
The `string.const` section indicates the literal as an `i32` index into
a new custom section: a string table, encoded as a `vec(vec(u8))` of
valid UTF-8 strings. Because literal strings can contain codepoint 0,
valid WTF-8 strings. Because literal strings can contain codepoint 0,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://simonsapin.github.io/wtf-8/#intended-audience: "WTF-8 must not be used [...] for transmission over the Internet."
Bummer, but arguably finding a viable solution for the constraints we're facing in Wasm outweighs what some other spec did or didn't mean itself to be used for.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!! Filed #29, just so that we can formally decide one way ot the other.

Comment thread README.md
Create a new string from the *`bytes`* WTF-8 bytes in memory at *`ptr`*.
Out-of-bounds access will trap. Attempting to create a string with
invalid WTF-8 will trap. The maximum value for *`bytes`* is
2<sup>31</sup>–1; passing a higher value traps.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Standardizing the max string length (in particular: in the Wasm-JS API spec) might be challenging. In V8, the current situation is that we allow up to (1 << 28) - 16 characters on 32-bit builds and up to (1 << 29) - 24 characters on 64-bit builds. The latter cannot be increased without massive implementation effort; I'm not sure whether the former could be increased to match the latter. I wouldn't be surprised if other browsers similarly had hard constraints defined by their implementation choices. The Wasm-JS API so far has been able to standardize precise, implementation-independent limits for everything -- but so far, no Wasm feature required this kind of tight alignment with existing implementations of JS features and their limits.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related investigation into string lengths: #12 (comment). I was very surprised that JSC allows strings up to 2^31-1 code units!! But OK.

I think the intention here is to allow WebAssembly to address any string that comes from the host. However when it comes to creating strings, unlike pre-GC WebAssembly, there is the possibility of dynamic allocation failure. Any string.new can fail, regardless of the length. This brings an inherent source of non-uniformity into the execution of a WebAssembly program but seems to be essential to the domain. On the other hand dynamic allocation failure can provide cover for implementation-specific limits; see also https://github.com/wingo/stringrefs#stringnew-size-limits. WDYT?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's not technically all the same: if an attempt to allocate a (small) string fails due to OOM, we will crash the process (not trap). If an allocation attempt fails because of a (spec or implementation defined) limit, we would at least have the option to trap instead. But I don't have a better suggestion that just relying on the generic "any allocation can fail" rule here.

Comment thread README.md
less than the *`codepoints`* parameter if end-of-string or
beginning-of-string would be reached. The *`codepoints`* argument is
interpreted as a `u32`.
### String positions
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering whether we can get away with not having cursors/positions at all, and instead entirely rely on string.encode_*ing the string's contents to memory (or a WasmGC array). Source languages that must offer an implementation of character-at would compile this to the creation of a temporary copy of the string, on which they can then perform whatever encoding-specific cursor manipulations they want. If we assume that character-at is typically used for iterating over (most of) an entire string, then this explicit conversion is probably only slightly more expensive than the currently proposed position-based approach (especially considering position validity checking), so that might be acceptable at least for an MVP. Creating string slices would then also involve dumping the source string to memory and string.new_*ing the desired part back into a string; I don't have a good intuition for how frequent this is.
To be clear, I'm not trying to argue for this change, just putting the idea out there. I'm honestly not sure about its feasibility.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good questions. I think that we certainly need get_wtf16, do you agree? A program in a source language that already expose u16 code unit access can't be easily transformed by a compiler to work on a copy without introducing algorithmic penalties.

I think users of the WTF-8 interfaces either actually want UTF-8, for sending a string somewhere, or are actually treating the string as a sequence of codepoints and need a way to denote a position in a string, be it for indicating which part of a string to slice / copy / compare or to access an individual codepoint. To some source languages, having that position be WTF-8 is almost immaterial -- they really just want codepoints, and the WTF-8 interface is the one that gives codepoints. Some source languages will care though, e.g. Rust which stores string lengths internally as UTF-8 byte lengths, and any source language that needs to encode WTF-8 bytes to memory. Related to #27.

I think we will need to iterate more here...

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define "algorithmic penalty" -- with the scheme I described, a single user-space get_wtf16 replacement would get more expensive (O(n)), but if you assume they usually happen in a loop over the string, then the loop's overall asymptotic complexity would remain O(n).

At any rate, this discussion probably deserves its own issue.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess #31 would be this issue. Thanks for filing that!

Comment thread README.md

```
(string.eq a:stringref a-cursor:i32 b:stringref b-cursor:i32 codepoints:i32) -> i32
(string.eq_wtf8 a:stringref a-pos:i32 b:stringref b-pos:i32 codepoints:i32) -> i32
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be nice to have a (string.eq a:stringref b:stringref), for what's likely the most common case of string comparisons: comparing entire strings. By skipping positions and length, this can be encoding-independent, which would avoid the issue that the encoding-specific variants will likely take a performance penalty on engines on which the respective encoding is foreign.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is equivalent to (string.eq_wtf8 a 0 b 0 -1) or (string.eq_wtf16 a 0 b 0 -1), so it would just be an abbreviation and not imply transcoding. Is that good enough? Something is still not quite right here though, feels weird to have these two interfaces.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what feels weird is that comparing two strings for equality conceptually shouldn't care about their encoding (because the same string could be encoded either way). But since the positions used as starting points have encoding-specific meaning, the current design needs the variants. Theoretically we'd even need at least one mixed variant, but I suppose we assume that wanting to mix wtf-8 and wtf-16 positions (a: assumed-wtf8-stringref, a-pos:wtf8-codepoint-index, b: assumed-wtf16-string, b-pos: wtf16-codeunit-index) won't happen in practice?

Comment thread README.md
also want to implement a map from UTF-16 position to UTF-8 position via
[breadcrumbs](https://www.swift.org/blog/utf8-string/#breadcrumbs).
UTF-8 position validation is ensuring the cursor is less than or equal
to the string byte length, and that `(ptr[cursor] & 0xb0) != 0x80`.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be & 0xc0?

Comment thread README.md

We expect that compilers that emit the WTF-16 interface place more
importance on `string.get_wtf16`. Implementations should ensure that
`string.get_wtf16` runs in near-linear time, even on systems that
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean "near-constant"? Linear time for a single get would be pretty slow.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah, right, a thinko on my part -- will fix.

Comment thread README.md
### Could abstract the concept of a string position?

The question is, if we see strings as sequences of codepoints that can
be seeked around in, what if we defined an abstract time for a cursor
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/time/type/, I guess?

@wingo wingo restored the native-utf-8-and-utf-16 branch February 10, 2022 07:59
wingo added a commit that referenced this pull request Feb 17, 2022
This commit fixes some nits after #26 was merged.
@wingo wingo mentioned this pull request Feb 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants