Skip to content

Possible simplification: drop all indexed/position-based access for MVP #31

@jakobkummerow

Description

@jakobkummerow

Continuing #26 (comment):

This is just an idea that came to my mind, I'm not sure it's feasible. That said, on a high level it would turn this proposal into a strict subset of its former self, and the removed parts could always be added back (post-MVP) if desired.

In short, the idea would be to only have two encoding-specific kinds operations: creating strings from linear memory (and WasmGC arrays), and writing them back to such. On strings themselves, only encoding-independent operations (concatenation, equality checking) would remain.

For source languages that need indexed operations, the source-to-Wasm compiler would have to emit conversions. There are at least three strategies how this could be done:

  • always two copies: a source string could be represented as a pair of an (i8/i16) array and a stringref. Depending on the operation, one or the other is used.
  • lazy copy: same as above, but one entry of the pair is created lazily by the first operation that needs it.
  • temporary conversion: array copies are created on demand and discarded afterwards. Assuming that indexed operations typically happen in loops, the algorithmic complexity (though not necessarily the constant-factor overhead) of such loops would remain the same. For example, a UTF-8 source language's for (int i = 0; i < string.length; i++) foo(string.usvAt(i)) would compile to something like:
    bytes = (string.measure_wtf8 string)
    temp = (array.new ($i8-array-type) bytes 0 /* default */)
    (string.encode_wtf8 string temp)
    for (i = 0; i < bytes; i++) {
      byte = (array.get temp i)
      if (byte > 0x7F) { /* decode more utf-8 bytes */ }
    }
    
    So the whole task of dealing with utf-8 falls to the module itself. Considering that this source language is used to representing UTF-8 strings in memory, it must already have the required logic for that task.

Benefits:

  • gives maximum flexibility to source languages
  • gives maximum opportunity to Wasm engines to choose their preferred internal encoding for stringrefs; as long as modules consistently use one encoding, the amount of re-encoding (though not copying; see below) is minimal: engines can go as far as supporting several internal encodings, turning a matching string.new_wtf8/string.encode_wtf8 pair into a pair of memcpy calls
  • in case of mixed-encoding usage patterns (not sure how frequent those are?), e.g. creating a string with string.new_wtf8 and then using string.get_wtf16 on it a lot, avoids Wasm engine implementation complexity and corresponding highly-unpredictable performance by instead letting the module control (and minimize) the conversions that have to happen. To clarify, when I say "engine complexity" I'm not "selfishly" talking about V8 here, as I believe this could be built on top of our existing string system with relative ease and relatively good performance; I'm more worried about Wasm engines that use utf8/wtf8 strings internally, and trying to run an e.g. Java-produced module on them.

Drawbacks:

  • limits the feature set offered by Wasm itself, requires source language compilers to do more work
  • the first point might well imply module binary size increases
  • increases the amount of copying of character data that needs to be done, which might make the resulting overall performance unacceptable
  • support for string slices would have to be dropped, as they aren't expressible without start/length information.

One potential strategy to resolve this question could be to build a prototype for the reduced feature set first, and gather some experimental data. The benefit of offering built-in indexed operations could then be measured as a delta to that. Since the feature set is a subset, this wouldn't create any extra work for the engine prototype; however it would create extra work for the module producer prototype.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions