Add shortcuts for all scripts used by more than 50 million people.#14
Conversation
updateMatchStateWide. This speeds up the code slightly due to the need to do fewer evaluations, but more importantly it opens the door to using different shortcuts in a wide and narrow context. This is helpful because extended Latin is full of ambiguous characters, making it impossible to write efficient shortcuts for languages using diacritics in Latin script.
Also make the shortcut tests more treelike, so we don't need to go through every test.
|
Benchmarks: master on left, this branch on right. |
|
Something to explore:
|
|
Answer to question #1 above: it uses a lot of characters from Arabic Presentation Forms B. There also exsits Arabic Presentation Forms A. The text says that these are only used for compatibility with older codepages, and are not needed for coding text. I interpret this to mean there are old documents around using them, but few new documents will be created. I have no idea whether we should include shortcuts for these blocks. A possible fix: choose a test file which uses the main block and not these presentation forms, and trust that's enough. Thoughts? |
|
If we change the Arabic benchmark to that suggested in #15, we get a more expected |
|
I feel like writing all these shortcuts by hand is something the computer should be able to do better than we can. The basic multilingual plane consists of 16^4 points. But we don't need to know all of them, just the boundary markers where the width changes. If we restrict to a narrow context (i.e. resolve ambiguous characters to Narrow), then there are 431 such boundaries in the BMP. There are 722 in a wide context. This means a binary search should be able to get the width with at most 9 comparisons in a narrow context, and 10 in a wide context. The current shortcuts only do better than that for ASCII in any context, for extended Latin in a Narrow context, and Han ideographs in a wide context. We seem to get much bigger slowdowns than that, and I suspect it has to do with the data types we're using not being easily optimised. Do you have any experience with this sort of optimisation? |
|
Excellent! This is great. If you want to explore these further optimizations, it's always nice to do better, but I think the current performance is perfectly adequate. (You can see the "real world" effects by looking at the tabular benchmarks for Greek and English; although in
Which data types are you referring to specifically? I don't see anything in the data structures used by Generating the boundaries by computer seems smart, if you feel like working this out. And if we can get rid of special emoji handling, that will simplify the code which is definitely good. Another thought: it could be worth considering whether the "real length" calculation is generally useful enough to abstract it out into a separate library, which doclayout could depend on. (But only if you feel like creating and maintaining such a library, of course!) |
|
I the ‘real length’ calculation is definitely useful enough on its own. The entire reason I'm here is that we wanted to use it in hledger to improve our tabular layouts, and the best way of doing that seemed to be to pull doclayout in as a dependency in table-layout so it could do this properly. Emoji handling is because some people apparently like putting cheese in their financial reports. It's also true that this would do well as a separate library. It would serve as a replacement for the unmaintained wcwidth library, which doesn't work in Windows for some reason. That said, I have no experience creating/maintaining libraries. Maybe this is a good opportunity to learn. |
|
If you want to create a separate library, you could use my emojis package as a model, since that's a very similar library in many ways. |
|
If I were to break it off in a separate library would you prefer that I do that soon to avoid cluttering your commit history, or would you prefer that I get the design improved in here first, before breaking it off when it's less likely to change (and require a separate release)? |
|
It's okay with me to develop it here for now. |
Plus a few less common ones that we can get for free (e.g. Greek and Armenian).