Add shortcuts for all scripts used by more than 50 million people. by Xitian9 · Pull Request #14 · jgm/doclayout

Xitian9 · 2021-10-23T10:44:32Z

Plus a few less common ones that we can get for free (e.g. Greek and Armenian).

updateMatchStateWide. This speeds up the code slightly due to the need to do fewer evaluations, but more importantly it opens the door to using different shortcuts in a wide and narrow context. This is helpful because extended Latin is full of ambiguous characters, making it impossible to write efficient shortcuts for languages using diacritics in Latin script.

Also make the shortcut tests more treelike, so we don't need to go through every test.

Xitian9 · 2021-10-23T10:52:23Z

Benchmarks: master on left, this branch on right.

benchmarking sample document 2					benchmarking sample document 2
time                 11.02 μs   (11.01 μs .. 11.03 μs)	      |	time                 11.00 μs   (10.99 μs .. 11.02 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)		                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 11.03 μs   (11.02 μs .. 11.04 μs)	      |	mean                 11.00 μs   (10.98 μs .. 11.02 μs)
std dev              30.36 ns   (24.23 ns .. 41.13 ns)	      |	std dev              57.13 ns   (43.34 ns .. 77.01 ns)

benchmarking reflow English					benchmarking reflow English
time                 118.3 μs   (118.2 μs .. 118.4 μs)	      |	time                 123.9 μs   (123.4 μs .. 124.4 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)		                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 118.2 μs   (118.1 μs .. 118.3 μs)	      |	mean                 123.1 μs   (122.9 μs .. 123.5 μs)
std dev              352.7 ns   (283.6 ns .. 466.2 ns)	      |	std dev              1.047 μs   (773.4 ns .. 1.484 μs)

benchmarking reflow Greek					benchmarking reflow Greek
time                 103.7 μs   (103.6 μs .. 103.7 μs)	      |	time                 108.1 μs   (107.7 μs .. 108.5 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)		                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 103.7 μs   (103.7 μs .. 103.8 μs)	      |	mean                 107.6 μs   (107.4 μs .. 107.7 μs)
std dev              320.2 ns   (247.6 ns .. 455.5 ns)	      |	std dev              626.1 ns   (464.4 ns .. 797.8 ns)

benchmarking tabular English					benchmarking tabular English
time                 1.648 ms   (1.644 ms .. 1.652 ms)	      |	time                 1.610 ms   (1.583 ms .. 1.632 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)	      |	                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 1.680 ms   (1.669 ms .. 1.693 ms)	      |	mean                 1.633 ms   (1.615 ms .. 1.686 ms)
std dev              47.04 μs   (39.34 μs .. 61.23 μs)	      |	std dev              112.1 μs   (48.88 μs .. 219.9 μs)
variance introduced by outliers: 18% (moderately inflated)    |	variance introduced by outliers: 57% (severely inflated)

benchmarking tabular Greek					benchmarking tabular Greek
time                 3.344 ms   (3.336 ms .. 3.351 ms)	      |	time                 1.857 ms   (1.837 ms .. 1.877 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)	      |	                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 3.345 ms   (3.340 ms .. 3.352 ms)	      |	mean                 1.847 ms   (1.832 ms .. 1.875 ms)
std dev              22.80 μs   (17.34 μs .. 33.63 μs)	      |	std dev              77.17 μs   (41.45 μs .. 119.1 μs)
							      >	variance introduced by outliers: 32% (moderately inflated)

benchmarking soft spaces at end of line				benchmarking soft spaces at end of line
time                 5.053 μs   (5.049 μs .. 5.057 μs)	      |	time                 5.420 μs   (5.339 μs .. 5.557 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)	      |	                     0.998 R²   (0.994 R² .. 1.000 R²)
mean                 5.052 μs   (5.049 μs .. 5.054 μs)	      |	mean                 5.378 μs   (5.356 μs .. 5.471 μs)
std dev              10.13 ns   (7.876 ns .. 13.46 ns)	      |	std dev              136.5 ns   (48.68 ns .. 322.1 ns)
							      >	variance introduced by outliers: 31% (moderately inflated)

benchmarking UDHR English					benchmarking UDHR English
time                 24.50 ms   (24.44 ms .. 24.58 ms)	      |	time                 25.51 ms   (24.76 ms .. 26.14 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)	      |	                     0.996 R²   (0.993 R² .. 0.999 R²)
mean                 24.59 ms   (24.55 ms .. 24.65 ms)	      |	mean                 24.97 ms   (24.63 ms .. 25.70 ms)
std dev              139.9 μs   (91.04 μs .. 222.8 μs)	      |	std dev              1.213 ms   (698.4 μs .. 2.150 ms)
							      >	variance introduced by outliers: 21% (moderately inflated)

benchmarking UDHR French					benchmarking UDHR French
time                 40.11 ms   (40.03 ms .. 40.19 ms)	      |	time                 26.23 ms   (25.99 ms .. 26.56 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)	      |	                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 40.20 ms   (40.12 ms .. 40.30 ms)	      |	mean                 26.48 ms   (26.34 ms .. 26.69 ms)
std dev              220.5 μs   (161.0 μs .. 308.0 μs)	      |	std dev              448.8 μs   (301.8 μs .. 617.3 μs)

benchmarking UDHR Vietnamese					benchmarking UDHR Vietnamese
time                 74.40 ms   (74.24 ms .. 74.53 ms)	      |	time                 26.59 ms   (26.32 ms .. 26.88 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)	      |	                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 74.37 ms   (74.26 ms .. 74.52 ms)	      |	mean                 26.41 ms   (26.32 ms .. 26.55 ms)
std dev              267.2 μs   (171.2 μs .. 391.0 μs)	      |	std dev              295.4 μs   (233.8 μs .. 413.9 μs)

benchmarking UDHR Mandarin					benchmarking UDHR Mandarin
time                 35.09 ms   (35.02 ms .. 35.15 ms)	      |	time                 31.52 ms   (31.12 ms .. 31.99 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)	      |	                     0.998 R²   (0.995 R² .. 0.999 R²)
mean                 35.05 ms   (35.00 ms .. 35.10 ms)	      |	mean                 31.48 ms   (31.23 ms .. 31.95 ms)
std dev              120.1 μs   (87.38 μs .. 170.8 μs)	      |	std dev              825.2 μs   (551.7 μs .. 1.312 ms)

benchmarking UDHR Arabic					benchmarking UDHR Arabic
time                 260.9 ms   (260.6 ms .. 261.7 ms)	      |	time                 186.4 ms   (178.9 ms .. 189.6 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)	      |	                     0.998 R²   (0.993 R² .. 1.000 R²)
mean                 261.0 ms   (260.6 ms .. 261.2 ms)	      |	mean                 195.0 ms   (191.6 ms .. 200.4 ms)
std dev              429.5 μs   (268.1 μs .. 611.4 μs)	      |	std dev              6.837 ms   (4.686 ms .. 9.765 ms)
variance introduced by outliers: 11% (moderately inflated)    <

benchmarking UDHR Hindi						benchmarking UDHR Hindi
time                 290.3 ms   (289.1 ms .. 291.2 ms)	      |	time                 36.75 ms   (36.35 ms .. 37.17 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)	      |	                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 290.4 ms   (290.2 ms .. 290.8 ms)	      |	mean                 36.96 ms   (36.75 ms .. 37.20 ms)
std dev              417.3 μs   (165.4 μs .. 670.0 μs)	      |	std dev              540.3 μs   (438.4 μs .. 692.6 μs)
variance introduced by outliers: 12% (moderately inflated)    <

benchmarking UDHR Bengali					benchmarking UDHR Bengali
time                 301.0 ms   (300.5 ms .. 301.5 ms)	      |	time                 38.21 ms   (38.12 ms .. 38.33 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)		                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 301.2 ms   (300.9 ms .. 301.5 ms)	      |	mean                 38.26 ms   (38.18 ms .. 38.38 ms)
std dev              466.5 μs   (309.0 μs .. 604.1 μs)	      |	std dev              214.1 μs   (128.7 μs .. 354.2 μs)
variance introduced by outliers: 12% (moderately inflated)    <

benchmarking UDHR Russian					benchmarking UDHR Russian
time                 260.3 ms   (259.6 ms .. 261.0 ms)	      |	time                 38.80 ms   (38.63 ms .. 39.02 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)	      |	                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 259.6 ms   (259.4 ms .. 259.9 ms)	      |	mean                 38.74 ms   (38.59 ms .. 38.96 ms)
std dev              366.0 μs   (164.4 μs .. 582.1 μs)	      |	std dev              438.6 μs   (282.1 μs .. 716.1 μs)
variance introduced by outliers: 11% (moderately inflated)    <

benchmarking UDHR Japanese					benchmarking UDHR Japanese
time                 48.43 ms   (48.34 ms .. 48.52 ms)	      |	time                 31.65 ms   (31.34 ms .. 31.83 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)	      |	                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 48.31 ms   (48.22 ms .. 48.37 ms)	      |	mean                 32.03 ms   (31.84 ms .. 32.37 ms)
std dev              164.3 μs   (120.6 μs .. 239.8 μs)	      |	std dev              623.6 μs   (271.6 μs .. 912.7 μs)

benchmarking UDHR Korean					benchmarking UDHR Korean
time                 202.2 ms   (201.9 ms .. 202.7 ms)	      |	time                 31.12 ms   (30.83 ms .. 31.62 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)	      |	                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 203.0 ms   (202.6 ms .. 204.3 ms)	      |	mean                 30.98 ms   (30.81 ms .. 31.19 ms)
std dev              1.036 ms   (159.4 μs .. 1.699 ms)	      |	std dev              504.4 μs   (380.3 μs .. 717.5 μs)

benchmarking UDHR Telugu					benchmarking UDHR Telugu
time                 313.2 ms   (309.0 ms .. 316.1 ms)	      |	time                 44.63 ms   (44.46 ms .. 44.81 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)	      |	                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 315.6 ms   (314.2 ms .. 318.6 ms)	      |	mean                 44.58 ms   (44.45 ms .. 44.72 ms)
std dev              2.810 ms   (1.641 ms .. 4.068 ms)	      |	std dev              310.5 μs   (247.8 μs .. 463.2 μs)
variance introduced by outliers: 12% (moderately inflated)    <

benchmarking UDHR Tamil						benchmarking UDHR Tamil
time                 307.1 ms   (306.8 ms .. 307.4 ms)	      |	time                 49.27 ms   (48.61 ms .. 49.89 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)	      |	                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 306.6 ms   (306.4 ms .. 306.8 ms)	      |	mean                 49.89 ms   (49.55 ms .. 50.21 ms)
std dev              316.3 μs   (223.2 μs .. 416.8 μs)	      |	std dev              729.7 μs   (626.3 μs .. 862.5 μs)
variance introduced by outliers: 12% (moderately inflated)    <

benchmarking UDHR Thai						benchmarking UDHR Thai
time                 341.2 ms   (340.5 ms .. 341.9 ms)	      |	time                 323.1 ms   (318.6 ms .. 326.3 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)	      |	                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 341.0 ms   (340.8 ms .. 341.3 ms)	      |	mean                 321.8 ms   (320.4 ms .. 323.3 ms)
std dev              344.3 μs   (249.4 μs .. 442.1 μs)	      |	std dev              2.281 ms   (1.881 ms .. 2.797 ms)
variance introduced by outliers: 12% (moderately inflated)	variance introduced by outliers: 12% (moderately inflated)

benchmarking UDHR Greek						benchmarking UDHR Greek
time                 288.2 ms   (287.6 ms .. 288.7 ms)	      |	time                 37.62 ms   (37.28 ms .. 37.97 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)	      |	                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 288.2 ms   (287.9 ms .. 288.3 ms)	      |	mean                 38.03 ms   (37.87 ms .. 38.23 ms)
std dev              304.6 μs   (214.7 μs .. 409.7 μs)	      |	std dev              433.9 μs   (355.0 μs .. 564.2 μs)
variance introduced by outliers: 12% (moderately inflated)    <

benchmarking Emoji						benchmarking Emoji
time                 319.0 ms   (316.3 ms .. 321.7 ms)	      |	time                 334.2 ms   (332.5 ms .. 335.8 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)		                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 318.7 ms   (317.7 ms .. 319.7 ms)	      |	mean                 335.0 ms   (334.3 ms .. 336.2 ms)
std dev              1.548 ms   (1.202 ms .. 1.941 ms)	      |	std dev              1.364 ms   (917.4 μs .. 1.714 ms)
variance introduced by outliers: 12% (moderately inflated)	variance introduced by outliers: 12% (moderately inflated)

Benchmark doclayout-bench: FINISH				Benchmark doclayout-bench: FINISH

Xitian9 · 2021-10-23T10:55:28Z

Something to explore:

Why is Arabic still relatively slow, even with shortcuts? They must use a lot of characters from a block without shortcuts.
Can the updateMatchNoShortcut code be improved by specialising to Narrow and Wide contexts
Can we get rid of all special emoji handling in the basic multilingual plane? I think we might be most of the way there.

Xitian9 · 2021-10-23T11:16:38Z

Answer to question #1 above: it uses a lot of characters from Arabic Presentation Forms B. There also exsits Arabic Presentation Forms A.

The text says that these are only used for compatibility with older codepages, and are not needed for coding text. I interpret this to mean there are old documents around using them, but few new documents will be created. I have no idea whether we should include shortcuts for these blocks.

A possible fix: choose a test file which uses the main block and not these presentation forms, and trust that's enough. Thoughts?

Xitian9 · 2021-10-23T11:33:22Z

If we change the Arabic benchmark to that suggested in #15, we get a more expected

benchmarking UDHR Arabic
time                 37.12 ms   (36.90 ms .. 37.29 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 37.88 ms   (37.61 ms .. 38.29 ms)
std dev              781.4 μs   (558.9 μs .. 1.053 ms)

Xitian9 · 2021-10-23T12:24:38Z

I feel like writing all these shortcuts by hand is something the computer should be able to do better than we can.

The basic multilingual plane consists of 16^4 points. But we don't need to know all of them, just the boundary markers where the width changes. If we restrict to a narrow context (i.e. resolve ambiguous characters to Narrow), then there are 431 such boundaries in the BMP. There are 722 in a wide context.

This means a binary search should be able to get the width with at most 9 comparisons in a narrow context, and 10 in a wide context. The current shortcuts only do better than that for ASCII in any context, for extended Latin in a Narrow context, and Han ideographs in a wide context.

We seem to get much bigger slowdowns than that, and I suspect it has to do with the data types we're using not being easily optimised. Do you have any experience with this sort of optimisation?

jgm · 2021-10-23T18:28:38Z

Excellent! This is great. If you want to explore these further optimizations, it's always nice to do better, but I think the current performance is perfectly adequate. (You can see the "real world" effects by looking at the tabular benchmarks for Greek and English; although in realLength benchmarks English is significantly faster than Greek, this translates into a fairly small difference in the tabular reflow benchmark, so we may be looking at diminishing returns for further improvements.)

I suspect it has to do with the data types we're using not being easily optimised. Do you have any experience with this sort of optimisation?

Which data types are you referring to specifically? I don't see anything in the data structures used by realLength that look suboptimal. Profiling could tell us more, though (both time and memory).

Generating the boundaries by computer seems smart, if you feel like working this out. And if we can get rid of special emoji handling, that will simplify the code which is definitely good.

Another thought: it could be worth considering whether the "real length" calculation is generally useful enough to abstract it out into a separate library, which doclayout could depend on. (But only if you feel like creating and maintaining such a library, of course!)

Xitian9 · 2021-10-23T21:44:56Z

I the ‘real length’ calculation is definitely useful enough on its own. The entire reason I'm here is that we wanted to use it in hledger to improve our tabular layouts, and the best way of doing that seemed to be to pull doclayout in as a dependency in table-layout so it could do this properly. Emoji handling is because some people apparently like putting cheese in their financial reports.

It's also true that this would do well as a separate library. It would serve as a replacement for the unmaintained wcwidth library, which doesn't work in Windows for some reason.

That said, I have no experience creating/maintaining libraries. Maybe this is a good opportunity to learn.

jgm · 2021-10-23T22:24:50Z

If you want to create a separate library, you could use my emojis package as a model, since that's a very similar library in many ways.

Xitian9 · 2021-10-24T11:21:39Z

If I were to break it off in a separate library would you prefer that I do that soon to avoid cluttering your commit history, or would you prefer that I get the design improved in here first, before breaking it off when it's less likely to change (and require a separate release)?

jgm · 2021-10-24T15:07:54Z

It's okay with me to develop it here for now.

Xitian9 added 4 commits October 23, 2021 08:26

Add shortcuts for extended Latin, Arabic, Cyrillic, and Greek.

7b7648f

Add shortcuts for Devanagari and Bengali.

bca063a

Also make the shortcut tests more treelike, so we don't need to go through every test.

Add shortcuts for Korean, Telugu, and Tamil.

4c0e7e0

jgm merged commit 6ba3187 into jgm:master Oct 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add shortcuts for all scripts used by more than 50 million people.#14

Add shortcuts for all scripts used by more than 50 million people.#14
jgm merged 4 commits into
jgm:masterfrom
Xitian9:speedup1

Xitian9 commented Oct 23, 2021

Uh oh!

Xitian9 commented Oct 23, 2021 •

edited

Loading

Uh oh!

Xitian9 commented Oct 23, 2021

Uh oh!

Xitian9 commented Oct 23, 2021

Uh oh!

Xitian9 commented Oct 23, 2021

Uh oh!

Xitian9 commented Oct 23, 2021 •

edited

Loading

Uh oh!

jgm commented Oct 23, 2021 •

edited

Loading

Uh oh!

Xitian9 commented Oct 23, 2021

Uh oh!

jgm commented Oct 23, 2021

Uh oh!

Xitian9 commented Oct 24, 2021

Uh oh!

jgm commented Oct 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Xitian9 commented Oct 23, 2021

Uh oh!

Xitian9 commented Oct 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Xitian9 commented Oct 23, 2021

Uh oh!

Xitian9 commented Oct 23, 2021

Uh oh!

Xitian9 commented Oct 23, 2021

Uh oh!

Xitian9 commented Oct 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgm commented Oct 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Xitian9 commented Oct 23, 2021

Uh oh!

jgm commented Oct 23, 2021

Uh oh!

Xitian9 commented Oct 24, 2021

Uh oh!

jgm commented Oct 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Xitian9 commented Oct 23, 2021 •

edited

Loading

Xitian9 commented Oct 23, 2021 •

edited

Loading

jgm commented Oct 23, 2021 •

edited

Loading