Skip to content

Clarify use of "codepoints" as string positions #7

@wingo

Description

@wingo

The proposal specifies that string.advance advances some number of "codepoints". But a surrogate is a codepoint; where is the cursor after advancing one codepoint in "\uD8000\uDC000" ? Is it at the end of the string or between the codepoints?

The answer here is subtle and needs clarifying. Seeking by codepoints is not the same as seeking by code units in a UTF-16 string; the intention is that the string-traversing instructions decode surrogate pairs. For this reason earlier drafts of this proposal instead spoke of positions in terms of unicode scalar values. But then what do you do with a JavaScript string that is invalid UTF-16? Do you make string.advance signal an error if the string is not a sequence of USVs?

For this reason I relaxed the wording to "codepoints", providing some degree of leniency. Isolated surrogates count as a single codepoint, for the purposes of "position in the string". You only trap if you attempt to encode a substring that isn't a USV sequence, and you can use the measure instruction to detect unencodable strings. But the document should make this more clear.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions