Continuing #26 (comment):
This is just an idea that came to my mind, I'm not sure it's feasible. That said, on a high level it would turn this proposal into a strict subset of its former self, and the removed parts could always be added back (post-MVP) if desired.
In short, the idea would be to only have two encoding-specific kinds operations: creating strings from linear memory (and WasmGC arrays), and writing them back to such. On strings themselves, only encoding-independent operations (concatenation, equality checking) would remain.
For source languages that need indexed operations, the source-to-Wasm compiler would have to emit conversions. There are at least three strategies how this could be done:
- always two copies: a source string could be represented as a pair of an (i8/i16) array and a stringref. Depending on the operation, one or the other is used.
- lazy copy: same as above, but one entry of the pair is created lazily by the first operation that needs it.
- temporary conversion: array copies are created on demand and discarded afterwards. Assuming that indexed operations typically happen in loops, the algorithmic complexity (though not necessarily the constant-factor overhead) of such loops would remain the same. For example, a UTF-8 source language's
for (int i = 0; i < string.length; i++) foo(string.usvAt(i)) would compile to something like:
bytes = (string.measure_wtf8 string)
temp = (array.new ($i8-array-type) bytes 0 /* default */)
(string.encode_wtf8 string temp)
for (i = 0; i < bytes; i++) {
byte = (array.get temp i)
if (byte > 0x7F) { /* decode more utf-8 bytes */ }
}
So the whole task of dealing with utf-8 falls to the module itself. Considering that this source language is used to representing UTF-8 strings in memory, it must already have the required logic for that task.
Benefits:
- gives maximum flexibility to source languages
- gives maximum opportunity to Wasm engines to choose their preferred internal encoding for stringrefs; as long as modules consistently use one encoding, the amount of re-encoding (though not copying; see below) is minimal: engines can go as far as supporting several internal encodings, turning a matching
string.new_wtf8/string.encode_wtf8 pair into a pair of memcpy calls
- in case of mixed-encoding usage patterns (not sure how frequent those are?), e.g. creating a string with
string.new_wtf8 and then using string.get_wtf16 on it a lot, avoids Wasm engine implementation complexity and corresponding highly-unpredictable performance by instead letting the module control (and minimize) the conversions that have to happen. To clarify, when I say "engine complexity" I'm not "selfishly" talking about V8 here, as I believe this could be built on top of our existing string system with relative ease and relatively good performance; I'm more worried about Wasm engines that use utf8/wtf8 strings internally, and trying to run an e.g. Java-produced module on them.
Drawbacks:
- limits the feature set offered by Wasm itself, requires source language compilers to do more work
- the first point might well imply module binary size increases
- increases the amount of copying of character data that needs to be done, which might make the resulting overall performance unacceptable
- support for string slices would have to be dropped, as they aren't expressible without start/length information.
One potential strategy to resolve this question could be to build a prototype for the reduced feature set first, and gather some experimental data. The benefit of offering built-in indexed operations could then be measured as a delta to that. Since the feature set is a subset, this wouldn't create any extra work for the engine prototype; however it would create extra work for the module producer prototype.
Continuing #26 (comment):
This is just an idea that came to my mind, I'm not sure it's feasible. That said, on a high level it would turn this proposal into a strict subset of its former self, and the removed parts could always be added back (post-MVP) if desired.
In short, the idea would be to only have two encoding-specific kinds operations: creating strings from linear memory (and WasmGC arrays), and writing them back to such. On strings themselves, only encoding-independent operations (concatenation, equality checking) would remain.
For source languages that need indexed operations, the source-to-Wasm compiler would have to emit conversions. There are at least three strategies how this could be done:
for (int i = 0; i < string.length; i++) foo(string.usvAt(i))would compile to something like:Benefits:
string.new_wtf8/string.encode_wtf8pair into a pair ofmemcpycallsstring.new_wtf8and then usingstring.get_wtf16on it a lot, avoids Wasm engine implementation complexity and corresponding highly-unpredictable performance by instead letting the module control (and minimize) the conversions that have to happen. To clarify, when I say "engine complexity" I'm not "selfishly" talking about V8 here, as I believe this could be built on top of our existing string system with relative ease and relatively good performance; I'm more worried about Wasm engines that use utf8/wtf8 strings internally, and trying to run an e.g. Java-produced module on them.Drawbacks:
One potential strategy to resolve this question could be to build a prototype for the reduced feature set first, and gather some experimental data. The benefit of offering built-in indexed operations could then be measured as a delta to that. Since the feature set is a subset, this wouldn't create any extra work for the engine prototype; however it would create extra work for the module producer prototype.