Skip to content

Can we avoid string cursors? #6

@wingo

Description

@wingo

It would be nice if we could avoid introducing the stringcursor type, because it leaks everywhere: you have to define JS mappings, represent it in ABI (I guess you pass it as a single ref object to function calls etc), make it nullable, figure out subtyping, etc. Terrible. We can do this if needed, but do we need to do it?

To answer this, let's back up a little: what is a string cursor anyway? Logically the cursor must hold two pieces of information: a reference to the string it is iterating over, and some information about the current position. Probably in most implementations that position information would have two parts: one to indicate the current answer of string.pos, and one "internal offset". The internal offset might be a code unit offset in JavaScript strings, or might be a byte offset in an implementation that uses UTF-8 strings. Call it the "pos" and the "offset". The offset is what gives cursors their O(1) access characteristic.

If we "unbox in the spec", probably we remove string.start, defining the initial position and offset to be 0. We would have:

(string.end str:stringref)
  -> pos:i32, offset:i32

No need for string.pos because we have the position directly. But the problem is, position and offset are two views of the same thing. Given one, you can compute the other. So which one do we specify these instructions as using? Let's assume it's the offset. Probably we'd rework the seeking instructions to just take an offset, then, and return a delta for the position, possibly clamped to end or beginning of string.

(string.advance str:stringref offset:i32 advance:i32)
  -> advanced:i32, offset:i32
(string.rewind str:stringref offset:i32 rewind:i32)
  -> rewound:i32, offset:i32

Actually you could replace string.end with string.advance on an offset of 0.

Also for string.measure and string.encode you would just pass the offset, I guess:

(string.measure $encoding str:stringref offset:i32 max-codepoints:i32)
  -> bytes:i32, valid:i32
(string.encode $encoding $memory str:stringref offset:i32 ptr:address max-bytes:i32)
  -> advanced:i32, offset:i32, bytes:i32

This is workable. Two problems though:

  1. The run-time can't trust that the offset is valid. For example on a system with utf-8 strings, the offset would be a byte offset; the user could just make up a value, so the instructions that take offsets would have to verify that it's within bounds, and at the beginning of a codepoint.
  2. We leak an internal implementation detail. Different run-times could have different offset values.

On the other hand, you could see (2) as a strength in a way; i.e. the browser embedding of WebAssembly could define offset to mean JS code units.

I guess in summary: it's possible to expose an "internal offset" as a cursor, as an i32. This limits flexibility of implementation by effectively specifying the representation of a cursor. But it removes the burden of having string cursors as a concept and opens up the possibility for defining what an internal offset is.

The downside is mainly the additional checks on offset validity. However these can be removed if the compiler proves that the offset is 0, or comes from a previous call to a cursor-advancing instruction. Also, given that we design to process string contents in chunks, perhaps check frequency is not high enough to be a burden.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions