Skip to content

Comments

Named characters#1109

Closed
GarkGarcia wants to merge 32 commits intomasterfrom
named-characters
Closed

Named characters#1109
GarkGarcia wants to merge 32 commits intomasterfrom
named-characters

Conversation

@GarkGarcia
Copy link
Contributor

This is a follow up to #1107. I've added a dictionary that maps the code points of named characters to their qualified names. Furthermore, I've renamed replace_wl_with_unicode to replace_wl_with_plain_text and added an optional parameter to it called use_unicode (it's True by default) that indicates to the function that named characters that have a unicode equivalent should be replaced with it (instead of the fully qualified name). If a named character doesn't have a unicode equivalent or if use_unicode is set to False then replace_wl_with_plain_text replaces it with it's fully qualified name.

@GarkGarcia GarkGarcia added the bug label Jan 13, 2021
@GarkGarcia GarkGarcia requested review from mmatera and rocky January 13, 2021 15:36
@GarkGarcia GarkGarcia marked this pull request as draft January 13, 2021 15:37
@rocky
Copy link
Member

rocky commented Jan 14, 2021

I don't have any comments so far. I thnk the things you've mentioned - that this needs to be coordinated with the changes to mathicscript and mathics-django since this change breaks them.

I'll look in detail though when this isn't a draft.

…he same as it's unicode equivalent to thei plain text representation
Copy link
Member

@rocky rocky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are going to create a YAML file (which I think is a great idea), then I think those comments should be put in dictionary items.

For example:

wl-to-unicode:
    AACute:
      wl-unicode: "\xE1"
      standard-unicode: "\xE1"
      wl-long-name: "\[AAcute]"
      standard-unicode-name: LATIN SMALL LETTER A WITH ACUTE
      wl-unicode-name: LATIN SMALL LETTER A WITH ACUTE

Note that if there are missing items, that's okay.

For more specific dictionaries or maps needed in python, as before a program can used to make the conversion.

It might also be good to see if converting with Cython improves load/lookup time.

@GarkGarcia
Copy link
Contributor Author

If you are going to create a YAML file (which I think is a great idea), then I think those comments should be put in dictionary items.

I don't find this particularly useful and I've written entirely too many conversion scripts already 😁️, but feel free to implement this and let me know if there's a usecase for it that I'm missing.

It might also be good to see if converting with Cython improves load/lookup time.

That's interesting, how would this work?

Copy link
Member

@rocky rocky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are going to create a YAML file (which I think is a great idea), then I think those comments should be put in dictionary items.

I don't find this particularly useful and I've written entirely too many conversion scripts already , but feel free to implement this and let me know if there's a usecase for it that I'm missing.

I'll write the conversion scripts. Let's not merge this then, until that's done.

With all of this growth, I think that all of this should be removed from Core and turned into its own module which understands scanning parsing and WL symbols.

Other tools like the pygments formatter and tools just for querying and converting Mathics are likely to want this, in debugging, reporting or in the tools that it provides. For example, a conversion tool where the user has the abiliy to specify turning all occurances of SMALL LATIN ACCUTE A into "`a" because I my tool is a TeX-like converter. And come to think of that, TeXFormat may want to make use of the unicode-name aspect too.

Having useful data in a YAML as a comment is just not good as having it available for such tools.

It might also be good to see if converting with Cython improves load/lookup time.

That's interesting, how would this work?

The same way it works now for say numbers.c, or pattern.c

I am very concerned about the loading time as a result of all the overhead that has just been added, and I am afraid that for this niggling concern over hundreds of symbols that most of the time most people aren't using or interested in we are adding maybe a second extra in startup time.

Some timing needs to be done before this goes in.

I suspect everything will be okay though if we drive this down to basically a load of a cython file which has all of the definitions and regular expressions patterns.

# Load the data on characters
with open(os.path.join(ROOT_DIR, "data/characters.yml"), "r") as f:
_CHAR_DATA = yaml.load(f)
_CHAR_DATA = yaml.load(f, Loader=yaml.FullLoader)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As suggested before, time this, and if this is slow speed it up
by doing the conversion and preprocessing once at install time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, I see. You're idea is to load the YAML at install time and pickle the dictionary or something?

Copy link
Member

@rocky rocky Jan 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://stackoverflow.com/questions/18517949/what-is-faster-loading-a-pickled-dictionary-object-or-loading-a-json-file-to

  • pickle
  • cPickle
  • json
  • simplejson
  • ujson
  • yajl

Gives timing from slowest (pickle) to fastest (yajl). Since the code is there, we should try running it on our particular data to see which works best.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'm running the tests now. Apparently cpickle isn't used anymore though (see https://stackoverflow.com/questions/56562929/modulenotfounderror-no-module-named-cpickle-on-python-3-7-3/56563226#56563226)

@GarkGarcia
Copy link
Contributor Author

Let's not merge this then, until that's done.

No problem. I still need to fix some of the entries in the CSV tables. I'll let you know when I get those finished. Also, please add a is-letter-like field to each entry of the dictionary.

With all of this growth, I think that all of this should be removed from Core and turned into its own module which understands scanning parsing and WL symbols.

I think this is a great idea, but keep in mind there are good reasons for having this in Core:

  1. It is actually need in parts of the parser.
  2. It's needed to escape strings with named characters inside of FullForm (which is something I haven't implemented yet but I plan to add to this PR).

The same way it works now for say numbers.c, or pattern.c

I'll take a look at those when I get all of the tests to pass.

I am very concerned about the loading time as a result of all the overhead that has just been added, and I am afraid that for this niggling concern over hundreds of symbols that most of the time most people aren't using or interested in we are adding maybe a second extra in startup time.

Surprisingly, the overhead doesn't look that big. I haven't tested it, and this is no excuse to not optimize the loading, but I haven't noticed a change. @rocky I'll work on the optimizations you proposed after I get the tests to pass, of course.

Some timing needs to be done before this goes in.

I suspect everything will be okay though if we drive this down to basically a load of a cython file which has all of the definitions and regular expressions patterns.

I agree, it's an easy fix overall (we just need to transfer the code from the ### INITIALIZATION ### section to something that runs at install-time).

@GarkGarcia
Copy link
Contributor Author

With all of this growth, I think that all of this should be removed from Core and turned into its own module which understands scanning parsing and WL symbols.

There's also a maintainability benefit I noticed while fixing the test: by having everything in a single dictionary, like in

ACup:
  wl: ...
  uni: ...
  ...

we avoid duplication of data and make it easier to maintain.

@rocky
Copy link
Member

rocky commented Jan 16, 2021

I think this is a great idea, but keep in mind there are good reasons for having this in Core:

  1. It is actually need in parts of the parser.
  2. It's needed to escape strings with named characters inside of FullForm (which is something I haven't . implemented yet but I plan to add to this PR).

I don't understand or follow this argument. Mathics Core imports tons of things, like mpmath or sympy. The fact that Mathics Core uses those imports isn't a reason to include the code for those imported modules inside Core. The python import mechanism works just fine.

Modules are split out from a larger body when it is found that the smaller part has a function that is useful and benefical that doesn't require the other parts.

Here, translation tables, scanning and parsing is a useful entity on its own. If you want to use the scanning and parsing outside of Mathics such as in a pygments formatter (or css formater), then you do not need or want the evaluator, numpy, sympy, pint and so on. It is possible or likey that things like TeXForm (SphinxForm, LaTeXForm, RsTForm, etc) could live totally outside of Mathics core. One thing that seems to come up a lot is that someone has data produced by WL, that they'd like to export it to some other format that WL doesn't provide.

Copy link
Contributor

@mmatera mmatera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, it looks fine. Can we merge this and then split this up? Or would be better to start the new package?

@rocky
Copy link
Member

rocky commented Jan 16, 2021

I prefer to start with a new package. There is too much complexity here for me to deal with.

There is a lot more that needs to be done, and in testing before I am comfortable and confident that this is safe, and those tests would be better done outside.

@GarkGarcia
Copy link
Contributor Author

Ok, I tested the performance of multiple data serialization/deserialization library to see which one loads our data the fastest. Here are the preliminary results:

  • json: 575.8368612159975
  • simplejson: 553.253133805003
  • ujson: 465.9347666499998
  • repidjson: 630.6619631499998
  • PyYAML: I left it running for like 3 hours and it didn't finish (?)
  • yajl: 820.8111407120014
  • pickle: 547.3880177629981

Here's the script I used for testing: https://pastebin.com/SpwwtDJH. I also wanted to test orjson but I couldn't install it via PyPI. As you can see, JSON is the clear winner overall and ujson is the most promising library. Apparently ujson is considered somewhat unsafe (see https://pythonspeed.com/articles/faster-json-library/), but we should be good since we never load untrusted data (we only load our own JSON files).

There's some other things I'd like to test too, such as:

  1. Is it faster to load everything in a big file or the load each dictionary separately from smaller files?
  2. Is it faster to compile the regexs at runtime or to load them from disk with pickle?

@GarkGarcia
Copy link
Contributor Author

I've tested if it's faster to load each table from a separate file or to load everything from a single big file (using ujson), here are the results:

  • big-file: 22.480467636000867
  • separate-files: 850.3780419389996

I think the results are clear enough. Here is the script I used: https://pastebin.com/NjLddZDk

@rocky
Copy link
Member

rocky commented Jan 17, 2021

@GarkGarcia This is awesome! Thanks for doing this!

(At work we are using ujson for exactly the same reason)

So the overall approach I'd suggest is to start with the nicely formatted YAML with the information arranged in a nice human-readable way for editing and reading. In fact it could doesn't have the be strict yaml - it could use https://pypi.org/project/yaml-include/ or one of the packages mentioned in https://stackoverflow.com/questions/528281/how-can-i-include-a-yaml-file-inside-another . If it is helpful to split the file up into smaller sections. Personally, I'd follow the same organization as is found on the WL site, whether one or many files.

From the YAML file, then preprocess the information into one or more ugly JSON files which has the set of the tables needed for each application. For example, Mathics Core might use one of the JSON files, but a formatter might use a smaller or different JSON file. Each file has been customized its needs.

Each of these JSON files will undoubtably have redundant information, both within a single file and redundancy between different files. Note this is in contrast to the YAML file (or files making up a one virtual YAML file).

The redundancy here is because the JSON is organized for fast access for a particular application. For example in Mathicsscript there is mapping from WL unicode to standard unicode and back. So these are two tables in JSON (or really two different dictionary keys in one big JSON dictionary) while in the YAML the information appears ony once.

The practice of autoderiving multiple (redundant) copies (for speed) from a non-redundant source is the sound practice that is used for eliminating potential problems when adds, deletes, and updates can happen as is the case here.

@rocky
Copy link
Member

rocky commented Jan 17, 2021

@GarkGarcia Where are the test.json files used in the tests?

@GarkGarcia
Copy link
Contributor Author

I just did one more test (the one about the regexes). We essencialy have two ways to deal with the regexes:

  1. Have the source strings stored in JSON/YAML/piclke in disk, read them when loading the rest of the information and compile them at runtime (that is, when the module is loaded).
  2. Pickle them and load them from disk.

Here are the results of my tests:

load-pickled: 17.69692592299907
compile-at-runtime 0.3634936589987774

Again, the results are pretty clear. Here's the test script: https://pastebin.com/6weyTwzh.

Notice I've hardcoded the regex source string. This is because if we go with the first approach it is assumed we have already loaded the rest of the information at this point (and loading the strings from JSON doesn't seem like such an overhead considering we will already be loading the massive tables).

@GarkGarcia
Copy link
Contributor Author

@GarkGarcia Where are the test.json files used in the tests?

Here are all the files I used in the tests: tests.tar.gz

@GarkGarcia
Copy link
Contributor Author

@GarkGarcia This is awesome! Thanks for doing this!

(At work we are using ujson for exactly the same reason)

So the overall approach I'd suggest is to start with the nicely formatted YAML with the information arranged in a nice human-readable way for editing and reading. In fact it could doesn't have the be strict yaml - it could use https://pypi.org/project/yaml-include/ or one of the packages mentioned in https://stackoverflow.com/questions/528281/how-can-i-include-a-yaml-file-inside-another . If it is helpful to split the file up into smaller sections. Personally, I'd follow the same organization as is found on the WL site, whether one or many files.

From the YAML file, then preprocess the information into one or more ugly JSON files which has the set of the tables needed for each application. For example, Mathics Core might use one of the JSON files, but a formatter might use a smaller or different JSON file. Each file has been customized its needs.

Each of these JSON files will undoubtably have redundant information, both within a single file and redundancy between different files. Note this is in contrast to the YAML file (or files making up a one virtual YAML file).

The redundancy here is because the JSON is organized for fast access for a particular application. For example in Mathicsscript there is mapping from WL unicode to standard unicode and back. So these are two tables in JSON (or really two different dictionary keys in one big JSON dictionary) while in the YAML the information appears ony once.

The practice of autoderiving multiple (redundant) copies (for speed) from a non-redundant source is the sound practice that is used for eliminating potential problems when adds, deletes, and updates can happen as is the case here.

I absolutelly agree. The tests show that precompiling the information to (ugly) JSON and loading them with ujson is clearly the best approach. I can get started with implementing this tomorrow.

Note this is in contrast to the YAML file (or files making up a one virtual YAML file).

Yes, but unicode-to-wl would still have to be kept as a separate table (because the default table isn't fully invertible). All of the rest of the information can be derived from the main table at install time, precompiled and stored as JSON on disk for fast access.

@@ -23,20 +23,20 @@ def unicode_equivalent(k: str, v: dict):
# Conversion from WL to the fully qualified names
WL_TO_PLAIN_DICT = {re.escape(v["wl-unicode"]): f"\\[{k}]"
Copy link
Member

@rocky rocky Jan 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the kind of thing that can be computed once per install and put into a JSON table. And re_from keys then would work on a set or a list rather than a dictionary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the kind of thing that can be computed once per install and put into a JSON table. And re_from keys then would work on a set or a list rather than a dictionary.

Yep, I plan to move this stuff to a install-time script after I get the tests to pass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially, this is just a draft of the install-time script 😁️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rocky Alright, it looks like the tests are passing. Could you help me figure out where/how should I place the install script?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - but I can't do this right now.

Depending on how my day job coding goes, in a day or so I'll copy all of this to a new repostory and set that up. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've said it before but I'll say it again. Overall I think this is great work! @GarkGarcia

It addresses one of the dark area of WL that we can bring light to. I think when this and the parser are put in a separate module there will be lots of uses of this outside of Mathics. I am pretty sure there are no other good and maintainable alternatives.

After this is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - but I can't do this right now.

Depending on how my day job coding goes, in a day or so I'll copy all of this to a new repostory and set that up. Thanks.

Great! I've set up the code in a way to makes it easy for you to extract it to a separate package: just copy and paste mathics/core/characters.py and transfer the section marked as INITIALIZATION at the beginning to the installation script.

I've said it before but I'll say it again. Overall I think this is great work! @GarkGarcia

Thanks!

It addresses one of the dark area of WL that we can bring light to. I think when this and the parser are put in a separate module there will be lots of uses of this outside of Mathics.

Yes, I can already see us using this module to generate the tables in the developer docs instead of the CSV files (the YAML file in here has all of the information they have + some other stuff)

@GarkGarcia
Copy link
Contributor Author

@rocky On regards to the name of the new library, I think "wl-named-character" is a good choice. I'd avoid adding "Mathics" to name since as you've comments this us usefull outside of Mathics too.

On regards to API design, I think the interface could use some massaging to make it more general and more ergonomic. wl_to_plain_text, wl_to_unicode and unicode_to_wl are needed for replace_wl_with_plain_text and replace_unicode_wlth_wl, so I'd keep those around. letterlikes could be replaced with a function like is_letter_like and aliased_characters could be replace with a function get_esc_alias (which takes WL unicode and returns the ESC alias if it exists).

@rocky
Copy link
Member

rocky commented Jan 18, 2021

@GarkGarcia @mmatera https://github.com/Mathics3/mathics-scanner is where this and the other scanner code is now moved to. Would appreicate it if in the future you try this and update there. #1117 has a PR to remove the code that has been moved.

@GarkGarcia
Copy link
Contributor Author

The code in here was transferred to https://github.com/Mathics3/mathics-scanner.

@GarkGarcia GarkGarcia closed this Jan 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants