Named characters by GarkGarcia · Pull Request #1109 · mathics/Mathics

GarkGarcia · 2021-01-13T15:36:22Z

This is a follow up to #1107. I've added a dictionary that maps the code points of named characters to their qualified names. Furthermore, I've renamed replace_wl_with_unicode to replace_wl_with_plain_text and added an optional parameter to it called use_unicode (it's True by default) that indicates to the function that named characters that have a unicode equivalent should be replaced with it (instead of the fully qualified name). If a named character doesn't have a unicode equivalent or if use_unicode is set to False then replace_wl_with_plain_text replaces it with it's fully qualified name.

…d-characters

rocky · 2021-01-14T21:31:54Z

I don't have any comments so far. I thnk the things you've mentioned - that this needs to be coordinated with the changes to mathicscript and mathics-django since this change breaks them.

I'll look in detail though when this isn't a draft.

…he same as it's unicode equivalent to thei plain text representation

mathics/core/util.py

mathics/core/characters.py

rocky

If you are going to create a YAML file (which I think is a great idea), then I think those comments should be put in dictionary items.

For example:

wl-to-unicode:
    AACute:
      wl-unicode: "\xE1"
      standard-unicode: "\xE1"
      wl-long-name: "\[AAcute]"
      standard-unicode-name: LATIN SMALL LETTER A WITH ACUTE
      wl-unicode-name: LATIN SMALL LETTER A WITH ACUTE

Note that if there are missing items, that's okay.

For more specific dictionaries or maps needed in python, as before a program can used to make the conversion.

It might also be good to see if converting with Cython improves load/lookup time.

GarkGarcia · 2021-01-16T15:41:50Z

If you are going to create a YAML file (which I think is a great idea), then I think those comments should be put in dictionary items.

I don't find this particularly useful and I've written entirely too many conversion scripts already 😁️, but feel free to implement this and let me know if there's a usecase for it that I'm missing.

It might also be good to see if converting with Cython improves load/lookup time.

That's interesting, how would this work?

rocky

If you are going to create a YAML file (which I think is a great idea), then I think those comments should be put in dictionary items.

I don't find this particularly useful and I've written entirely too many conversion scripts already , but feel free to implement this and let me know if there's a usecase for it that I'm missing.

I'll write the conversion scripts. Let's not merge this then, until that's done.

With all of this growth, I think that all of this should be removed from Core and turned into its own module which understands scanning parsing and WL symbols.

Other tools like the pygments formatter and tools just for querying and converting Mathics are likely to want this, in debugging, reporting or in the tools that it provides. For example, a conversion tool where the user has the abiliy to specify turning all occurances of SMALL LATIN ACCUTE A into "`a" because I my tool is a TeX-like converter. And come to think of that, TeXFormat may want to make use of the unicode-name aspect too.

Having useful data in a YAML as a comment is just not good as having it available for such tools.

It might also be good to see if converting with Cython improves load/lookup time.

That's interesting, how would this work?

The same way it works now for say numbers.c, or pattern.c

I am very concerned about the loading time as a result of all the overhead that has just been added, and I am afraid that for this niggling concern over hundreds of symbols that most of the time most people aren't using or interested in we are adding maybe a second extra in startup time.

Some timing needs to be done before this goes in.

I suspect everything will be okay though if we drive this down to basically a load of a cython file which has all of the definitions and regular expressions patterns.

rocky · 2021-01-16T15:37:49Z

mathics/core/characters.py

 # Load the data on characters
 with open(os.path.join(ROOT_DIR, "data/characters.yml"), "r") as f:
-    _CHAR_DATA = yaml.load(f)
+    _CHAR_DATA = yaml.load(f, Loader=yaml.FullLoader)


As suggested before, time this, and if this is slow speed it up
by doing the conversion and preprocessing once at install time.

Ohh, I see. You're idea is to load the YAML at install time and pickle the dictionary or something?

https://stackoverflow.com/questions/18517949/what-is-faster-loading-a-pickled-dictionary-object-or-loading-a-json-file-to

pickle

cPickle

json

simplejson

ujson

yajl

Gives timing from slowest (pickle) to fastest (yajl). Since the code is there, we should try running it on our particular data to see which works best.

Ok, I'm running the tests now. Apparently cpickle isn't used anymore though (see https://stackoverflow.com/questions/56562929/modulenotfounderror-no-module-named-cpickle-on-python-3-7-3/56563226#56563226)

GarkGarcia · 2021-01-16T17:00:17Z

Let's not merge this then, until that's done.

No problem. I still need to fix some of the entries in the CSV tables. I'll let you know when I get those finished. Also, please add a is-letter-like field to each entry of the dictionary.

With all of this growth, I think that all of this should be removed from Core and turned into its own module which understands scanning parsing and WL symbols.

I think this is a great idea, but keep in mind there are good reasons for having this in Core:

It is actually need in parts of the parser.
It's needed to escape strings with named characters inside of FullForm (which is something I haven't implemented yet but I plan to add to this PR).

The same way it works now for say numbers.c, or pattern.c

I'll take a look at those when I get all of the tests to pass.

I am very concerned about the loading time as a result of all the overhead that has just been added, and I am afraid that for this niggling concern over hundreds of symbols that most of the time most people aren't using or interested in we are adding maybe a second extra in startup time.

Surprisingly, the overhead doesn't look that big. I haven't tested it, and this is no excuse to not optimize the loading, but I haven't noticed a change. @rocky I'll work on the optimizations you proposed after I get the tests to pass, of course.

Some timing needs to be done before this goes in.

I suspect everything will be okay though if we drive this down to basically a load of a cython file which has all of the definitions and regular expressions patterns.

I agree, it's an easy fix overall (we just need to transfer the code from the ### INITIALIZATION ### section to something that runs at install-time).

GarkGarcia · 2021-01-16T17:04:27Z

With all of this growth, I think that all of this should be removed from Core and turned into its own module which understands scanning parsing and WL symbols.

There's also a maintainability benefit I noticed while fixing the test: by having everything in a single dictionary, like in

ACup:
  wl: ...
  uni: ...
  ...

we avoid duplication of data and make it easier to maintain.

rocky · 2021-01-16T17:13:34Z

I think this is a great idea, but keep in mind there are good reasons for having this in Core:

It is actually need in parts of the parser.

It's needed to escape strings with named characters inside of FullForm (which is something I haven't . implemented yet but I plan to add to this PR).

I don't understand or follow this argument. Mathics Core imports tons of things, like mpmath or sympy. The fact that Mathics Core uses those imports isn't a reason to include the code for those imported modules inside Core. The python import mechanism works just fine.

Modules are split out from a larger body when it is found that the smaller part has a function that is useful and benefical that doesn't require the other parts.

Here, translation tables, scanning and parsing is a useful entity on its own. If you want to use the scanning and parsing outside of Mathics such as in a pygments formatter (or css formater), then you do not need or want the evaluator, numpy, sympy, pint and so on. It is possible or likey that things like TeXForm (SphinxForm, LaTeXForm, RsTForm, etc) could live totally outside of Mathics core. One thing that seems to come up a lot is that someone has data produced by WL, that they'd like to export it to some other format that WL doesn't provide.

mmatera

For me, it looks fine. Can we merge this and then split this up? Or would be better to start the new package?

rocky · 2021-01-16T19:21:54Z

I prefer to start with a new package. There is too much complexity here for me to deal with.

There is a lot more that needs to be done, and in testing before I am comfortable and confident that this is safe, and those tests would be better done outside.

GarkGarcia · 2021-01-17T12:23:43Z

Ok, I tested the performance of multiple data serialization/deserialization library to see which one loads our data the fastest. Here are the preliminary results:

json: 575.8368612159975
simplejson: 553.253133805003
ujson: 465.9347666499998
repidjson: 630.6619631499998
PyYAML: I left it running for like 3 hours and it didn't finish (?)
yajl: 820.8111407120014
pickle: 547.3880177629981

Here's the script I used for testing: https://pastebin.com/SpwwtDJH. I also wanted to test orjson but I couldn't install it via PyPI. As you can see, JSON is the clear winner overall and ujson is the most promising library. Apparently ujson is considered somewhat unsafe (see https://pythonspeed.com/articles/faster-json-library/), but we should be good since we never load untrusted data (we only load our own JSON files).

There's some other things I'd like to test too, such as:

Is it faster to load everything in a big file or the load each dictionary separately from smaller files?
Is it faster to compile the regexs at runtime or to load them from disk with pickle?

GarkGarcia · 2021-01-17T13:34:47Z

I've tested if it's faster to load each table from a separate file or to load everything from a single big file (using ujson), here are the results:

big-file: 22.480467636000867
separate-files: 850.3780419389996

I think the results are clear enough. Here is the script I used: https://pastebin.com/NjLddZDk

rocky · 2021-01-17T14:03:44Z

@GarkGarcia This is awesome! Thanks for doing this!

(At work we are using ujson for exactly the same reason)

So the overall approach I'd suggest is to start with the nicely formatted YAML with the information arranged in a nice human-readable way for editing and reading. In fact it could doesn't have the be strict yaml - it could use https://pypi.org/project/yaml-include/ or one of the packages mentioned in https://stackoverflow.com/questions/528281/how-can-i-include-a-yaml-file-inside-another . If it is helpful to split the file up into smaller sections. Personally, I'd follow the same organization as is found on the WL site, whether one or many files.

From the YAML file, then preprocess the information into one or more ugly JSON files which has the set of the tables needed for each application. For example, Mathics Core might use one of the JSON files, but a formatter might use a smaller or different JSON file. Each file has been customized its needs.

Each of these JSON files will undoubtably have redundant information, both within a single file and redundancy between different files. Note this is in contrast to the YAML file (or files making up a one virtual YAML file).

The redundancy here is because the JSON is organized for fast access for a particular application. For example in Mathicsscript there is mapping from WL unicode to standard unicode and back. So these are two tables in JSON (or really two different dictionary keys in one big JSON dictionary) while in the YAML the information appears ony once.

The practice of autoderiving multiple (redundant) copies (for speed) from a non-redundant source is the sound practice that is used for eliminating potential problems when adds, deletes, and updates can happen as is the case here.

rocky · 2021-01-17T14:13:26Z

@GarkGarcia Where are the test.json files used in the tests?

GarkGarcia · 2021-01-17T14:15:49Z

I just did one more test (the one about the regexes). We essencialy have two ways to deal with the regexes:

Have the source strings stored in JSON/YAML/piclke in disk, read them when loading the rest of the information and compile them at runtime (that is, when the module is loaded).
Pickle them and load them from disk.

Here are the results of my tests:

load-pickled: 17.69692592299907
compile-at-runtime 0.3634936589987774

Again, the results are pretty clear. Here's the test script: https://pastebin.com/6weyTwzh.

Notice I've hardcoded the regex source string. This is because if we go with the first approach it is assumed we have already loaded the rest of the information at this point (and loading the strings from JSON doesn't seem like such an overhead considering we will already be loading the massive tables).

GarkGarcia · 2021-01-17T14:18:44Z

@GarkGarcia Where are the test.json files used in the tests?

Here are all the files I used in the tests: tests.tar.gz

GarkGarcia · 2021-01-17T14:24:31Z

@GarkGarcia This is awesome! Thanks for doing this!

(At work we are using ujson for exactly the same reason)

So the overall approach I'd suggest is to start with the nicely formatted YAML with the information arranged in a nice human-readable way for editing and reading. In fact it could doesn't have the be strict yaml - it could use https://pypi.org/project/yaml-include/ or one of the packages mentioned in https://stackoverflow.com/questions/528281/how-can-i-include-a-yaml-file-inside-another . If it is helpful to split the file up into smaller sections. Personally, I'd follow the same organization as is found on the WL site, whether one or many files.

From the YAML file, then preprocess the information into one or more ugly JSON files which has the set of the tables needed for each application. For example, Mathics Core might use one of the JSON files, but a formatter might use a smaller or different JSON file. Each file has been customized its needs.

Each of these JSON files will undoubtably have redundant information, both within a single file and redundancy between different files. Note this is in contrast to the YAML file (or files making up a one virtual YAML file).

The redundancy here is because the JSON is organized for fast access for a particular application. For example in Mathicsscript there is mapping from WL unicode to standard unicode and back. So these are two tables in JSON (or really two different dictionary keys in one big JSON dictionary) while in the YAML the information appears ony once.

The practice of autoderiving multiple (redundant) copies (for speed) from a non-redundant source is the sound practice that is used for eliminating potential problems when adds, deletes, and updates can happen as is the case here.

I absolutelly agree. The tests show that precompiling the information to (ugly) JSON and loading them with ujson is clearly the best approach. I can get started with implementing this tomorrow.

Note this is in contrast to the YAML file (or files making up a one virtual YAML file).

Yes, but unicode-to-wl would still have to be kept as a separate table (because the default table isn't fully invertible). All of the rest of the information can be derived from the main table at install time, precompiled and stored as JSON on disk for fast access.

rocky · 2021-01-17T22:26:33Z

mathics/core/characters.py

@@ -23,20 +23,20 @@ def unicode_equivalent(k: str, v: dict):
 # Conversion from WL to the fully qualified names
 WL_TO_PLAIN_DICT = {re.escape(v["wl-unicode"]): f"\\[{k}]" 


This is the kind of thing that can be computed once per install and put into a JSON table. And re_from keys then would work on a set or a list rather than a dictionary.

This is the kind of thing that can be computed once per install and put into a JSON table. And re_from keys then would work on a set or a list rather than a dictionary.

Yep, I plan to move this stuff to a install-time script after I get the tests to pass

Essentially, this is just a draft of the install-time script 😁️

@rocky Alright, it looks like the tests are passing. Could you help me figure out where/how should I place the install script?

Sure - but I can't do this right now.

Depending on how my day job coding goes, in a day or so I'll copy all of this to a new repostory and set that up. Thanks.

I've said it before but I'll say it again. Overall I think this is great work! @GarkGarcia

It addresses one of the dark area of WL that we can bring light to. I think when this and the parser are put in a separate module there will be lots of uses of this outside of Mathics. I am pretty sure there are no other good and maintainable alternatives.

After this is

Sure - but I can't do this right now.

Depending on how my day job coding goes, in a day or so I'll copy all of this to a new repostory and set that up. Thanks.

Great! I've set up the code in a way to makes it easy for you to extract it to a separate package: just copy and paste mathics/core/characters.py and transfer the section marked as INITIALIZATION at the beginning to the installation script.

I've said it before but I'll say it again. Overall I think this is great work! @GarkGarcia

Thanks!

It addresses one of the dark area of WL that we can bring light to. I think when this and the parser are put in a separate module there will be lots of uses of this outside of Mathics.

Yes, I can already see us using this module to generate the tables in the developer docs instead of the CSV files (the YAML file in here has all of the information they have + some other stuff)

By avoiding escaping the keys of our dictionaries as much as possible we make them moregeneric in a sence, which is good for the library

…disk as JSON

mathics.core.characters.aliased_characters was accidentally removed at some point

GarkGarcia · 2021-01-18T14:41:45Z

@rocky On regards to the name of the new library, I think "wl-named-character" is a good choice. I'd avoid adding "Mathics" to name since as you've comments this us usefull outside of Mathics too.

On regards to API design, I think the interface could use some massaging to make it more general and more ergonomic. wl_to_plain_text, wl_to_unicode and unicode_to_wl are needed for replace_wl_with_plain_text and replace_unicode_wlth_wl, so I'd keep those around. letterlikes could be replaced with a function like is_letter_like and aliased_characters could be replace with a function get_esc_alias (which takes WL unicode and returns the ESC alias if it exists).

rocky · 2021-01-18T22:13:06Z

@GarkGarcia @mmatera https://github.com/Mathics3/mathics-scanner is where this and the other scanner code is now moved to. Would appreicate it if in the future you try this and update there. #1117 has a PR to remove the code that has been moved.

GarkGarcia · 2021-01-19T17:05:17Z

The code in here was transferred to https://github.com/Mathics3/mathics-scanner.

GarkGarcia added 3 commits January 13, 2021 23:48

Added a dictionary for qualified named characters

9146c44

Merge branch 'master' of https://github.com/mathics/Mathics into name…

a5873dd

…d-characters

Refactored and renamed stuff

242d47a

GarkGarcia added the bug label Jan 13, 2021

GarkGarcia requested review from mmatera and rocky January 13, 2021 15:36

GarkGarcia marked this pull request as draft January 13, 2021 15:37

GarkGarcia mentioned this pull request Jan 13, 2021

Named characters don't really work #1107

Open

Fixed a bug that aliased named character whose WL representation is t…

6e370d5

…he same as it's unicode equivalent to thei plain text representation

GarkGarcia commented Jan 15, 2021

View reviewed changes

mathics/core/util.py Outdated Show resolved Hide resolved

Moved stuff to mathics.core.character and removed duplicate code

21f3876

rocky reviewed Jan 15, 2021

View reviewed changes

mathics/core/characters.py Show resolved Hide resolved

GarkGarcia added 3 commits January 16, 2021 22:06

Fixed WL_TO_PLAIN_DICT

65cc33d

Extracted the dictionaries to a separate file and cleaned the code

552eea6

Added yaml as a dependency

8a6d409

rocky reviewed Jan 16, 2021

View reviewed changes

GarkGarcia added 2 commits January 17, 2021 01:32

Fixed wrong import

1e5a315

Fixed deprecation warning

a91dfc1

rocky requested changes Jan 16, 2021

View reviewed changes

GarkGarcia added 6 commits January 17, 2021 01:55

Fixed some errors

be7ff7e

Fixed the last failing test

21509d8

Fixed another test

1b532ac

Fixed yet another test

86fffbb

Added missing entries named-characters

2857749

Fixed syntax error

2ce9ef2

Removed unnecessary file

1b08cc1

mmatera approved these changes Jan 16, 2021

View reviewed changes

GarkGarcia added 5 commits January 18, 2021 04:30

Changed the format of the yml table

a8ab963

Fixed syntax error

b37653f

Fixed the name of a file

825cf0b

Fixed typo

2918b62

Fixed (more) typos

05e03d5

rocky reviewed Jan 17, 2021

View reviewed changes

GarkGarcia added 9 commits January 18, 2021 08:29

Fixed (more) typos

e4485fc

Fixed (more) typos

cf48973

Removed unnecessary comments (the old code wass commented)

0d9f981

Avoided escaping the keys whenever possible

dcbcf15

By avoiding escaping the keys of our dictionaries as much as possible we make them moregeneric in a sence, which is good for the library

Started working on the script to pre-process the data and dump it to …

b93b953

…disk as JSON

Added ujson to the dependencies

968d4a3

Removed unnecessary import

b486d0c

Added mathics.core.characters.aliased_characters again

2526275

mathics.core.characters.aliased_characters was accidentally removed at some point

Cleaned the code

0d0c9d9

Renamed some variables to make their meaning clearer

ce8b986

GarkGarcia closed this Jan 19, 2021

		@@ -23,20 +23,20 @@ def unicode_equivalent(k: str, v: dict):
		# Conversion from WL to the fully qualified names
		WL_TO_PLAIN_DICT = {re.escape(v["wl-unicode"]): f"\\[{k}]"

Uh oh!

Comments

Conversation

GarkGarcia commented Jan 13, 2021

Uh oh!

rocky commented Jan 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rocky left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GarkGarcia commented Jan 16, 2021

Uh oh!

rocky left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rocky Jan 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GarkGarcia commented Jan 16, 2021

Uh oh!

GarkGarcia commented Jan 16, 2021

Uh oh!

rocky commented Jan 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mmatera left a comment

Choose a reason for hiding this comment

Uh oh!

rocky commented Jan 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GarkGarcia commented Jan 17, 2021

Uh oh!

GarkGarcia commented Jan 17, 2021

Uh oh!

rocky commented Jan 17, 2021

Uh oh!

rocky commented Jan 17, 2021

Uh oh!

GarkGarcia commented Jan 17, 2021

Uh oh!

GarkGarcia commented Jan 17, 2021

Uh oh!

GarkGarcia commented Jan 17, 2021

Uh oh!

rocky Jan 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GarkGarcia commented Jan 18, 2021

Uh oh!

rocky commented Jan 18, 2021

Uh oh!

GarkGarcia commented Jan 19, 2021

Uh oh!

Reviewers

rocky commented Jan 14, 2021 •

edited

Loading

rocky left a comment •

edited

Loading

rocky left a comment •

edited

Loading

rocky Jan 16, 2021 •

edited

Loading

rocky commented Jan 16, 2021 •

edited

Loading

rocky commented Jan 16, 2021 •

edited

Loading

rocky Jan 17, 2021 •

edited

Loading