diff --git a/README.md b/README.md
index 198d539..91d8386 100644
--- a/README.md
+++ b/README.md
@@ -26,18 +26,21 @@ find good compromises are "minimal" and "viable".
4. Allow string literals in element sections
## Definitions
- - *codepoint*: an integer in the range [0,0x10FFFF].
- - *surrogate*: a codepoint in the range [0xD800,0xDFFF].
- - *unicode scalar value*: a codepoint that is not a surrogate.
- - *character*: an imprecise concept that we try to avoid in this
+ - *codepoint*: An integer in the range [0,0x10FFFF].
+ - *surrogate*: A codepoint in the range [0xD800,0xDFFF].
+ - *unicode scalar value*: A codepoint that is not a surrogate.
+ - *character*: An imprecise concept that we try to avoid in this
document.
- - *code unit*: a codepoint in the range [0,0xFFFF].
- - *high surrogate*: a surrogate in the range [0xD800,0xDBFF].
- - *low surrogate*: a surrogate which is not a high surrogate.
- - *surrogate pair*: a sequence of a *high surrogate* followed by a *low
+ - *code unit*: An indivisible unit of an encoded unicode scalar value.
+ For UTF-8 encodings, an integer in the range [0,0xFF] (a byte); for
+ UTF-16 encodings, an integer in the range [0,0xFFFF]; for UTF-32,
+ the unicode scalar value itself.
+ - *high surrogate*: A surrogate in the range [0xD800,0xDBFF].
+ - *low surrogate*: A surrogate which is not a high surrogate.
+ - *surrogate pair*: A sequence of a *high surrogate* followed by a *low
surrogate*, used by UTF-16 to encode a codepoint in the range
[0x10000,0x10FFFF].
- - *isolated surrogate*: any surrogate which is not part of a surrogate
+ - *isolated surrogate*: Any surrogate which is not part of a surrogate
pair.
## Design
@@ -52,20 +55,14 @@ of implications.
JS strings are immutable, so WebAssembly strings should also be
immutable.
-#### No `get-char-at` method
-
-A JavaScript string stores its length in code units, not unicode scalar
-values. In the general case, getting the *n*th USV from a string
-requires parsing all preceding code units. We would not want to design
-an API that would encourage straightforward uses (e.g. looping over
-characters) to run in quadratic time.
-
#### Polymorphism
JS engines typically represent strings in many different ways: strings
-which are "narrow" (one byte per code unit) or "wide" (two bytes), rope
-strings or not, string slices or not, and external or not. That's at
-least 16 different kinds of strings.
+which are "narrow" (in which all code units are in [0,0xFF] and can use
+a fixed-width encoding with only one byte per code unit) or "wide" (two
+bytes per code unit, 1 or 2 code units per codepoint), rope strings or
+not, string slices or not, and external or not. That's at least 16
+different kinds of strings.
JavaScript can mitigate this polymorphism to a degree via adaptive
compilation, which can devirtualize based on the kind of strings seen at
@@ -136,16 +133,28 @@ If we were just considering simplicity, the best solution would be to
say "strings are sequences of unicode scalar values", but we know that
for JavaScript this is not the case.
-However, we think we can get closer to the simple solution by only
-including interfaces that treat strings as USV sequences, for example by
-only including routines that access string contents in terms of UTF-8,
-UTF-16, and other valid Unicode encodings that exclude isolated
-surrogates by construction. Isolated surrogates are rare in JavaScript
-and the tail should not wag the dog.
+However, we think we can get closer to the simple solution by primarily
+working in terms of unicode scalar values. WebAssembly on its own
+should not be able to create strings with isolated surrogates, and
+therefore we should only include support for reading and writing
+standard Unicode encoding schemes which exclude isolated surrogates by
+construction, for example UTF-8 and UTF-16. Isolated surrogates are
+rare in JavaScript and the tail should not wag the dog.
+
+Such problematic strings can come from a host, however, and where it is
+as simple to define a behavior as to require an implementation to trap,
+we will lean towards defined non-trapping behavior. The proposal also
+leaves the door open to add interfaces that access string contents using
+more general encoding schemes such as WTF-8 if needed in the future.
+
+#### No `get-char-at` method
-It could be that we're wrong, though, and so we'd need to leave the door
-open to add interfaces that access string contents using more general
-encodings such as WTF-8.
+A JavaScript string is composed of a sequence of 16-bit code units which
+encode a sequence of codepoints, in which each codepoint corresponds to
+1 or 2 code units. In the general case, getting the *n*th unicode
+scalar value from a string requires parsing all preceding code units.
+We would not want to design an API that would encourage straightforward
+uses (e.g. looping over unicode scalar values) to run in quadratic time.
### Oracle: JS engine C++ API
@@ -306,11 +315,30 @@ value of 0 denotes the string start (before the first USV). It follows
that 0 may also denote the end of the string also, for zero-length
strings.
-The specific mapping from USV offset to cursor value is
-implementation-defined. It may be that specific embeddings may give
-cursor values a meaning: for example, a JS embedding may specify that a
-cursor is a code unit offset, or a UTF-8-only embedding may use byte
-offsets.
+A cursor value is an offset into a string. Cursors uniquely identify a
+position in a string, and are ordered, and therefore can be compared
+against each other.
+
+If a host represents strings internally using UTF-8, UTF-16, or UTF-32,
+or a variant thereof such as the one used in JavaScript, cursor values
+are code unit offsets.
+
+For example, because JavaScript hosts have to represent strings as
+(logical) sequences of 16-bit code units, a WebAssembly string cursor
+for a WebAssembly implementation embedded in a web browser will be a
+code unit offset (the same as the operand to JavaScript's
+[`String.prototype.charCodeAt`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/charCodeAt).
+
+For a WebAssembly implementation that represents strings as UTF-8
+internally, cursor values are byte offsets. The intention is that
+accessing content in a string with a cursor has the least possible
+overhead. This also allows hosts to communicate string positions with
+WebAssembly programs.
+
+If a host does not represent strings using a unicode encoding scheme,
+the specific mapping from USV offset to cursor value is
+implementation-defined, with the requirement that values be ordered and
+that each position must have one and only one cursor value.
To move a cursor to a new position, use the `string.advance` and
`string.rewind` seek instructions.
@@ -356,8 +384,8 @@ trap.
Given that there must be a distinct cursor value for each codepoint in a
string, and one for the end, this constrains the strings that are
-processed by to a maximum of 231–1 codepoints. Not all
-strings have packed cursor values, so the codepoint size limit in
+processed by to a maximum of 231–1 code units. Not all
+strings have one code unit per codepoint, so the codepoint size limit in
practice may be lower for any given string. This specification can
therefore represent codepoint counts with an `i32` without risk of
overflow.
@@ -369,8 +397,8 @@ may allow for 64-bit variants of the cursor-using instructions, which
could relax these restrictions.
In practice, no web browser embedding allows for strings longer than
-231–1 code units, so no string from JavaScript is out of
-range for the instructions in this proposal.
+231–1 UTF-16 code units, so no string from JavaScript is too
+large for the instructions in this proposal.
### Accessing string contents
@@ -690,43 +718,76 @@ https://github.com/guybedford/proposal-is-usv-string
Assuming that the non-browser implementation uses UTF-8 as the native
string representation, then a stringref is a pointer, a length, and a
-reference count. A cursor is a byte offset into the string. Cursor
-validation is ensuring the cursor is less than or equal to the string
-byte length, and that `(ptr[cursor] & 0x80) == 0`. Measuring UTF-8
-encodings is just length minus the cursor. Measuring UTF-16 would be
-via a callout. Encoding UTF-8 is just `memcpy`.
+reference count. The specification requires that cursor values be UTF-8
+code unit offsets, which are byte offsets from the beginning of the
+string. Cursor validation is ensuring the cursor is less than or equal
+to the string byte length, and that `(ptr[cursor] & 0xb0) != 0x80`.
+Measuring UTF-8 encodings is just length minus the cursor. Measuring
+UTF-16 would be via a callout. Encoding UTF-8 is just `memcpy`.
### What's the expected implementation in web browsers?
-We expect that web browsers use JS strings as `stringref`. We expect a
-cursor to be a code unit offset. Seeking, measuring, encoding, and
-equality predicates would call out to run-time functions that would
-dispatch over polymorphic values. The support for one-byte encodings
-may prove to be a performance benefit also.
-
-### The meaning of string cursor values is implementation-defined?!?
-
-It's a compromise. We really want to support efficient access to
-JavaScript strings, but we don't want to expose the idea of iterating
-JavaScript strings in terms of code units, because of non-web
-embeddings. So we iterate instead in terms of unicode scalar values,
-with defined exceptional behavior if there are isolated surrogates. But
-mapping USV offset to code unit offset is O(n) -- so we need cursors
-somehow to avoid quadratic algorithms.
-
-The question is to answer is, should we expose cursors as i32 values
-that the user can see, or keep them opaque in a new type? Opaque
-cursors could avoid checks in some cases and would hide
-implementation-defined differences. But if strings are processed in
-chunks, cursor validity check overhead is likely to be low relative to
-overall program time. So it's just a tradeoff then between exposing
-implementation-defined cursor values, versus the cognitive/spec overhead
-of having to define an opaque cursor type. After going back and forth
-on this multiple times, it would seem that i32 cursors are a workable
-local maximum.
-
-See https://github.com/wingo/wasm-strings/issues/6 for a full
-discussion.
+We expect that web browsers use JS strings as `stringref`. The
+specification then requires that cursor values be UTF-16 code unit
+offsets. Seeking, measuring, encoding, and equality predicates would
+likely call out to run-time functions that would dispatch over
+polymorphic values. The support for one-byte encodings may prove to be
+a performance benefit also.
+
+### Why define string cursors in terms of the host's string representation?
+
+The purpose of a string cursor is to allow efficient access to string
+contents, starting at a specific position.
+
+Under the hood, string cursors must relate to host string
+representation. For example, we really want to support efficient access
+to JavaScript strings, so string cursors in a web browser should express
+positions in terms of UTF-16 code unit offsets. But we don't want
+WebAssembly strings to be specified in terms of UTF-16 only; non-web
+embeddings will likely represent strings internally using other
+encodings (often UTF-8). So instead we advance cursors in units of
+unicode scalar values, with some allowances for isolated surrogates from
+JavaScript. But we can't define string cursors as being USV offsets,
+because mapping USV offset to code unit offset is O(n). Cursors allow
+us to avoid quadratic algorithms.
+
+The question then becomes, because cursor values relate to a host's
+string representation, should we hide the details of what a string
+cursor is from users, in the name of abstraction and common defined
+behavior?
+
+All things being equal, it would have been nice to define string cursors
+in such a way that a program running on a UTF-8 host would behave
+exactly the same as for a UTF-16 host. We could have provided this
+property by making string cursors opaque. This could have gone two
+ways: if we made cursors a first-class reference-typed value, cursors
+could hold a reference to their strings directly. There would then be
+no need for cursor validity checks. On the other hand, then we would
+have a new type that would infiltrate everything, from implementation to
+JavaScript API to the type system and so on. And, absent compiler
+heroics, reference-typed cursors may cause high allocation overheads.
+
+The other way you could make cursors opaque would be as opaque scalar
+values. The idea is that a cursor is really an `i32` under the hood,
+but its value isn't accessible. Such a cursor wouldn't stand alone in
+the way reference-typed cursors would: you need to pass a string and a
+cursor to instructions, and you need to check the cursor for validity
+with regards to the string. We still have some of the type profusion
+issues from a "cognitive load" point of view. But, you couldn't observe
+the difference in cursor values between implementations, which would be
+a nice property.
+
+In the end though, besides simplicity, what tipped the balance towards
+plain `i32` values was precisely that string cursors could be meaningful
+to the host instead of opaque. A host should be able to reason about
+string positions and communicate those positions to WebAssembly -- after
+all, the strings belong to the host too. Specifying that string cursors
+are code unit offsets makes this possible, while also constraining
+e.g. WebAssembly implementations in different web browsers to all use
+the same notion of string offsets.
+
+See https://github.com/wingo/wasm-strings/issues/6 and
+https://github.com/wingo/wasm-strings/issues/11 for a full discussion.
### Are stringrefs nullable?
@@ -772,10 +833,3 @@ That said though, any copy is likely to remain in cache, amortizing the
cost of the second access. Inlining the (likely) UTF-8 accesses on the
WebAssembly side seems more important than preventing a copy by using a
codepoint-by-codepoint non-copying interface.
-
-### What are the size limits on a string?
-
-This is probably something for the
-[embedding](https://webassembly.github.io/spec/js-api/index.html#limits)
-to specify. I would guess that codepoint lengths in [0,2^32-1] should
-be the limit from the POV of the core spec.