Can we avoid string cursors?

It would be nice if we could avoid introducing the `stringcursor` type, because it leaks everywhere: you have to define JS mappings, represent it in ABI (I guess you pass it as a single ref object to function calls etc), make it nullable, figure out subtyping, etc.  Terrible.  We can do this if needed, but do we need to do it?

To answer this, let's back up a little: what is a string cursor anyway?  Logically the cursor must hold two pieces of information: a reference to the string it is iterating over, and some information about the current position.  Probably in most implementations that position information would have two parts: one to indicate the current answer of `string.pos`, and one "internal offset".  The internal offset might be a code unit offset in JavaScript strings, or might be a byte offset in an implementation that uses UTF-8 strings.  Call it the "pos" and the "offset".  The offset is what gives cursors their O(1) access characteristic.

If we "unbox in the spec", probably we remove `string.start`, defining the initial position and offset to be 0.  We would have:
```
(string.end str:stringref)
  -> pos:i32, offset:i32
```

No need for `string.pos` because we have the position directly.  But the problem is, position and offset are two views of the same thing.  Given one, you can compute the other.  So which one do we specify these instructions as using?  Let's assume it's the offset.  Probably we'd rework the seeking instructions to just take an offset, then, and return a delta for the position, possibly clamped to end or beginning of string.

```
(string.advance str:stringref offset:i32 advance:i32)
  -> advanced:i32, offset:i32
(string.rewind str:stringref offset:i32 rewind:i32)
  -> rewound:i32, offset:i32
```

Actually you could replace `string.end` with `string.advance` on an offset of 0.

Also for `string.measure` and `string.encode` you would just pass the offset, I guess:
```
(string.measure $encoding str:stringref offset:i32 max-codepoints:i32)
  -> bytes:i32, valid:i32
(string.encode $encoding $memory str:stringref offset:i32 ptr:address max-bytes:i32)
  -> advanced:i32, offset:i32, bytes:i32
```

This is workable.  Two problems though:
 1. The run-time can't trust that the offset is valid.  For example on a system with utf-8 strings, the offset would be a byte offset; the user could just make up a value, so the instructions that take offsets would have to verify that it's within bounds, and at the beginning of a codepoint.
 2. We leak an internal implementation detail.  Different run-times could have different offset values.

On the other hand, you could see (2) as a strength in a way; i.e. the browser embedding of WebAssembly could define offset to mean JS code units.

I guess in summary: it's possible to expose an "internal offset" as a cursor, as an i32.  This limits flexibility of implementation by effectively specifying the representation of a cursor.  But it removes the burden of having string cursors as a concept and opens up the possibility for defining what an internal offset is.

The downside is mainly the additional checks on offset validity.  However these can be removed if the compiler proves that the offset is 0, or comes from a previous call to a cursor-advancing instruction.  Also, given that we design to process string contents in chunks, perhaps check frequency is not high enough to be a burden.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we avoid string cursors? #6

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Can we avoid string cursors? #6

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions