Change to use WTF-8/WTF-16 positions instead of abstract cursors by wingo · Pull Request #26 · wingo/stringrefs

wingo · 2022-02-07T15:55:02Z

This patch switches from using implementation-specified i32 cursors to denote positions in strings via WTF-8 and WTF-16 offsets. Using the WTF-8 interface treats strings as sequences of codepoints, whereas the WTF-16 interface treats them as WTF-16 code units. The examples and FAQ are also updated.

See #21, #23, #24.

This patch switches from using implementation-specified i32 cursors to denote positions in strings via WTF-8 and WTF-16 offsets. Using the WTF-8 interface treats strings as sequences of codepoints, whereas the WTF-16 interface treats them as WTF-16 code units. The examples and FAQ are not yet updated, however.

wingo · 2022-02-07T16:03:12Z

cc @jakobkummerow, @skuzmich

skuzmich · 2022-02-07T17:40:58Z

Thanks! Looks like updated semantics and instruction set (assuming having GC array <-> stringref conversions in the future) would make stringref a good type for core Kotlin String when targeting embeddings with WTF-16 strings.

However, when targeting WTF-8 embeddings, it is currently unclear if compatibility with host strings would justify cost of non-native stringref.get_wtf16, since proposal mentions breadcrumbs and single-slot cache among acceptable solutions.

Co-authored-by: Ms2ger <Ms2ger@gmail.com>

wingo · 2022-02-09T14:11:43Z

OK this is better than the old "it's an i32 but you can't trust it" approach. Let's land and iterate!

jakobkummerow

A couple of comments. (Sorry that it took me a while to find sufficient time to review this properly, it's a large diff.)

No objections to having merged it, of course.

jakobkummerow · 2022-02-09T21:04:41Z

+
+Encode the contents of the string *`str`* as UTF-8 or WTF-8,
+respectively, starting at the WTF-8 byte offset *`pos`*, to memory at
+*`ptr`*, limited to *`bytes`* bytes.  Return the number of bytes


The ptr:address parameter was lost in the edit.

Firstly, thank you thank you for the close review! Will follow up in a new PR. Good catch here.

jakobkummerow · 2022-02-09T21:05:49Z

+Encode the contents of the string *`str`* as UTF-8 or WTF-8,
+respectively, starting at the WTF-8 byte offset *`pos`*, to memory at
+*`ptr`*, limited to *`bytes`* bytes.  Return the number of bytes
 written.  Note that no `NUL` terminator is ever written.  If any


Would it make sense to limit the maximum output to 2^31-2 (instead of ...-1) so that callers can add their own NUL byte (if desired) and still remain in int32 range for the overall length?

I think that given that you would write the NUL with i32.store_u8 whose memarg operand takes a u32 offset (https://webassembly.github.io/spec/core/syntax/instructions.html#syntax-memarg) that perhaps no change is needed here. Provided that ptr < 2^31, you can always write 2^31-1 bytes and then (i32.store_u8 offset:2^31 alighn:1) relative to pos. LMK if I am misunderstanding the issue though.

(also you could pass 2^31-2 as the bytes operand, if that's what you wanted)

jakobkummerow · 2022-02-09T21:08:03Z


-In this MVP, it is not possible to produce a string which is not a valid
-sequence of unicode scalar values.
+All instructions which take string positions trap if the position is not


I'm worried that this might be expensive.
IIUC, it's not linear in complexity, because in both WTF-16 and WTF-8 code units can be classified in O(1) as being a non-first element of a surrogate pair/chain or not. Still, there's a cost to doing this check for every access. We might just have to prototype it and see if that cost is acceptable.

Good question! Related to #10.

I think that for WTF-16 things are easy -- a valid position is one that's in range. I think we have to check for this already so there's no additional work. Do lmk if I am not thinking straight. Note that the WTF-16 interfaces index over code units and not codepoints -- i.e. there's no special requirement forbidding a position value which is the offset of the low surrogate of a surrogate pair.

For WTF-8, we have the range check (no additional work even on WTF-16 hosts) and then the codepoint check. Agreed that this is a cost. Let's keep thinking about this in #10.

jakobkummerow · 2022-02-09T21:09:26Z

 The `string.const` section indicates the literal as an `i32` index into
 a new custom section: a string table, encoded as a `vec(vec(u8))` of
-valid UTF-8 strings.  Because literal strings can contain codepoint 0,
+valid WTF-8 strings.  Because literal strings can contain codepoint 0,


https://simonsapin.github.io/wtf-8/#intended-audience: "WTF-8 must not be used [...] for transmission over the Internet."
Bummer, but arguably finding a viable solution for the constraints we're facing in Wasm outweighs what some other spec did or didn't mean itself to be used for.

Good catch!! Filed #29, just so that we can formally decide one way ot the other.

jakobkummerow · 2022-02-09T21:26:02Z

+Create a new string from the *`bytes`* WTF-8 bytes in memory at *`ptr`*.
+Out-of-bounds access will trap.  Attempting to create a string with
+invalid WTF-8 will trap.  The maximum value for *`bytes`* is
+2<sup>31</sup>–1; passing a higher value traps.


Standardizing the max string length (in particular: in the Wasm-JS API spec) might be challenging. In V8, the current situation is that we allow up to (1 << 28) - 16 characters on 32-bit builds and up to (1 << 29) - 24 characters on 64-bit builds. The latter cannot be increased without massive implementation effort; I'm not sure whether the former could be increased to match the latter. I wouldn't be surprised if other browsers similarly had hard constraints defined by their implementation choices. The Wasm-JS API so far has been able to standardize precise, implementation-independent limits for everything -- but so far, no Wasm feature required this kind of tight alignment with existing implementations of JS features and their limits.

Related investigation into string lengths: #12 (comment). I was very surprised that JSC allows strings up to 2^31-1 code units!! But OK.

I think the intention here is to allow WebAssembly to address any string that comes from the host. However when it comes to creating strings, unlike pre-GC WebAssembly, there is the possibility of dynamic allocation failure. Any string.new can fail, regardless of the length. This brings an inherent source of non-uniformity into the execution of a WebAssembly program but seems to be essential to the domain. On the other hand dynamic allocation failure can provide cover for implementation-specific limits; see also https://github.com/wingo/stringrefs#stringnew-size-limits. WDYT?

Well, it's not technically all the same: if an attempt to allocate a (small) string fails due to OOM, we will crash the process (not trap). If an allocation attempt fails because of a (spec or implementation defined) limit, we would at least have the option to trap instead. But I don't have a better suggestion that just relying on the generic "any allocation can fail" rule here.

jakobkummerow · 2022-02-09T21:28:06Z

-less than the *`codepoints`* parameter if end-of-string or
-beginning-of-string would be reached.  The *`codepoints`* argument is
-interpreted as a `u32`.
+### String positions


I'm wondering whether we can get away with not having cursors/positions at all, and instead entirely rely on string.encode_*ing the string's contents to memory (or a WasmGC array). Source languages that must offer an implementation of character-at would compile this to the creation of a temporary copy of the string, on which they can then perform whatever encoding-specific cursor manipulations they want. If we assume that character-at is typically used for iterating over (most of) an entire string, then this explicit conversion is probably only slightly more expensive than the currently proposed position-based approach (especially considering position validity checking), so that might be acceptable at least for an MVP. Creating string slices would then also involve dumping the source string to memory and string.new_*ing the desired part back into a string; I don't have a good intuition for how frequent this is.
To be clear, I'm not trying to argue for this change, just putting the idea out there. I'm honestly not sure about its feasibility.

Yeah good questions. I think that we certainly need get_wtf16, do you agree? A program in a source language that already expose u16 code unit access can't be easily transformed by a compiler to work on a copy without introducing algorithmic penalties.

I think users of the WTF-8 interfaces either actually want UTF-8, for sending a string somewhere, or are actually treating the string as a sequence of codepoints and need a way to denote a position in a string, be it for indicating which part of a string to slice / copy / compare or to access an individual codepoint. To some source languages, having that position be WTF-8 is almost immaterial -- they really just want codepoints, and the WTF-8 interface is the one that gives codepoints. Some source languages will care though, e.g. Rust which stores string lengths internally as UTF-8 byte lengths, and any source language that needs to encode WTF-8 bytes to memory. Related to #27.

I think we will need to iterate more here...

Define "algorithmic penalty" -- with the scheme I described, a single user-space get_wtf16 replacement would get more expensive (O(n)), but if you assume they usually happen in a loop over the string, then the loop's overall asymptotic complexity would remain O(n).

At any rate, this discussion probably deserves its own issue.

I guess #31 would be this issue. Thanks for filing that!

jakobkummerow · 2022-02-09T21:34:59Z


 ```
-(string.eq a:stringref a-cursor:i32 b:stringref b-cursor:i32 codepoints:i32) -> i32
+(string.eq_wtf8 a:stringref a-pos:i32 b:stringref b-pos:i32 codepoints:i32) -> i32


It might also be nice to have a (string.eq a:stringref b:stringref), for what's likely the most common case of string comparisons: comparing entire strings. By skipping positions and length, this can be encoding-independent, which would avoid the issue that the encoding-specific variants will likely take a performance penalty on engines on which the respective encoding is foreign.

This is equivalent to (string.eq_wtf8 a 0 b 0 -1) or (string.eq_wtf16 a 0 b 0 -1), so it would just be an abbreviation and not imply transcoding. Is that good enough? Something is still not quite right here though, feels weird to have these two interfaces.

I think what feels weird is that comparing two strings for equality conceptually shouldn't care about their encoding (because the same string could be encoded either way). But since the positions used as starting points have encoding-specific meaning, the current design needs the variants. Theoretically we'd even need at least one mixed variant, but I suppose we assume that wanting to mix wtf-8 and wtf-16 positions (a: assumed-wtf8-stringref, a-pos:wtf8-codepoint-index, b: assumed-wtf16-string, b-pos: wtf16-codeunit-index) won't happen in practice?

jakobkummerow · 2022-02-09T21:42:17Z

+also want to implement a map from UTF-16 position to UTF-8 position via
+[breadcrumbs](https://www.swift.org/blog/utf8-string/#breadcrumbs).
+UTF-8 position validation is ensuring the cursor is less than or equal
 to the string byte length, and that `(ptr[cursor] & 0xb0) != 0x80`.


Shouldn't this be & 0xc0?

jakobkummerow · 2022-02-09T21:44:48Z

+
+We expect that compilers that emit the WTF-16 interface place more
+importance on `string.get_wtf16`.  Implementations should ensure that
+`string.get_wtf16` runs in near-linear time, even on systems that


Do you mean "near-constant"? Linear time for a single get would be pretty slow.

Hah, right, a thinko on my part -- will fix.

jakobkummerow · 2022-02-09T21:45:20Z

+### Could abstract the concept of a string position?
+
+The question is, if we see strings as sequences of codepoints that can
+be seeked around in, what if we defined an abstract time for a cursor


s/time/type/, I guess?

This commit fixes some nits after #26 was merged.

lars-t-hansen reviewed Feb 7, 2022

View reviewed changes

Comment thread README.md Outdated

Update examples and FAQ

eed82fa

Ms2ger approved these changes Feb 9, 2022

View reviewed changes

Comment thread README.md Outdated

Comment thread README.md Outdated

Comment thread README.md Outdated

wingo and others added 3 commits February 9, 2022 14:05

Update README.md

b4e8629

Co-authored-by: Ms2ger <Ms2ger@gmail.com>

Update README.md

f27c2b1

Co-authored-by: Ms2ger <Ms2ger@gmail.com>

Make get_wtf16 consistent with get_wtf8

6898f1b

wingo merged commit 6bd5f6d into main Feb 9, 2022

wingo deleted the native-utf-8-and-utf-16 branch February 9, 2022 14:11

wingo mentioned this pull request Feb 9, 2022

More expressive string.measure results? #15

Closed

jakobkummerow reviewed Feb 9, 2022

View reviewed changes

wingo restored the native-utf-8-and-utf-16 branch February 10, 2022 07:59

jakobkummerow mentioned this pull request Feb 10, 2022

Possible simplification: drop all indexed/position-based access for MVP #31

Closed

wingo added a commit that referenced this pull request Feb 17, 2022

Editorial fixes

7c3455a

This commit fixes some nits after #26 was merged.

wingo mentioned this pull request Feb 17, 2022

Editorial fixes #34

Merged

Conversation

wingo commented Feb 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wingo commented Feb 7, 2022

Uh oh!

Uh oh!

skuzmich commented Feb 7, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wingo commented Feb 9, 2022

Uh oh!

jakobkummerow left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wingo commented Feb 7, 2022 •

edited

Loading