Possible simplification: drop all indexed/position-based access for MVP

Continuing https://github.com/wingo/stringrefs/pull/26#discussion_r803102621:

This is just an idea that came to my mind, I'm not sure it's feasible. That said, on a high level it would turn this proposal into a strict subset of its former self, and the removed parts could always be added back (post-MVP) if desired.

In short, the idea would be to only have two encoding-specific kinds operations: creating strings from linear memory (and WasmGC arrays), and writing them back to such. On strings themselves, only encoding-independent operations (concatenation, equality checking) would remain.

For source languages that need indexed operations, the source-to-Wasm compiler would have to emit conversions. There are at least three strategies how this could be done: 
- always two copies: a source string could be represented as a pair of an (i8/i16) array and a stringref. Depending on the operation, one or the other is used.
- lazy copy: same as above, but one entry of the pair is created lazily by the first operation that needs it.
- temporary conversion: array copies are created on demand and discarded afterwards. Assuming that indexed operations typically happen in loops, the algorithmic complexity (though not necessarily the constant-factor overhead) of such loops would remain the same. For example, a UTF-8 source language's `for (int i = 0; i < string.length; i++) foo(string.usvAt(i))` would compile to something like:
  ```
  bytes = (string.measure_wtf8 string)
  temp = (array.new ($i8-array-type) bytes 0 /* default */)
  (string.encode_wtf8 string temp)
  for (i = 0; i < bytes; i++) {
    byte = (array.get temp i)
    if (byte > 0x7F) { /* decode more utf-8 bytes */ }
  }
  ```
  So the whole task of dealing with utf-8 falls to the module itself. Considering that this source language is used to representing UTF-8 strings in memory, it must already have the required logic for that task.

Benefits:
- gives maximum flexibility to source languages
- gives maximum opportunity to Wasm engines to choose their preferred internal encoding for stringrefs; as long as modules consistently use one encoding, the amount of re-encoding (though not copying; see below) is minimal: engines can go as far as supporting several internal encodings, turning a matching `string.new_wtf8`/`string.encode_wtf8` pair into a pair of `memcpy` calls
- in case of mixed-encoding usage patterns (not sure how frequent those are?), e.g. creating a string with `string.new_wtf8` and then using `string.get_wtf16` on it a lot, avoids Wasm engine implementation complexity and corresponding highly-unpredictable performance by instead letting the module control (and minimize) the conversions that have to happen. To clarify, when I say "engine complexity" I'm not "selfishly" talking about V8 here, as I believe this could be built on top of our existing string system with relative ease and relatively good performance; I'm more worried about Wasm engines that use utf8/wtf8 strings internally, and trying to run an e.g. Java-produced module on them.

Drawbacks:
- limits the feature set offered by Wasm itself, requires source language compilers to do more work
- the first point might well imply module binary size increases
- increases the amount of copying of character data that needs to be done, which might make the resulting overall performance unacceptable
- support for string slices would have to be dropped, as they aren't expressible without start/length information.

One potential strategy to resolve this question could be to build a prototype for the reduced feature set first, and gather some experimental data. The benefit of offering built-in indexed operations could then be measured as a delta to that. Since the feature set is a subset, this wouldn't create any extra work for the engine prototype; however it _would_ create extra work for the module producer prototype. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible simplification: drop all indexed/position-based access for MVP #31

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Possible simplification: drop all indexed/position-based access for MVP #31

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions