Clarify use of "codepoints" as string positions

The proposal specifies that `string.advance` advances some number of "codepoints".  But a surrogate is a codepoint; where is the cursor after advancing one codepoint in `"\uD8000\uDC000"` ? Is it at the end of the string or between the codepoints?

The answer here is subtle and needs clarifying.  Seeking by codepoints is not the same as seeking by code units in a UTF-16 string; the intention is that the string-traversing instructions decode surrogate pairs.  For this reason earlier drafts of this proposal instead spoke of positions in terms of unicode scalar values.  But then what do you do with a JavaScript string that is invalid UTF-16?  Do you make `string.advance` signal an error if the string is not a sequence of USVs?

For this reason I relaxed the wording to "codepoints", providing some degree of leniency.  Isolated surrogates count as a single codepoint, for the purposes of "position in the string".  You only trap if you attempt to encode a substring that isn't a USV sequence, and you can use the `measure` instruction to detect unencodable strings.  But the document should make this more clear.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify use of "codepoints" as string positions #7

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Clarify use of "codepoints" as string positions #7

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions