Conversation
|
I don't have any comments so far. I thnk the things you've mentioned - that this needs to be coordinated with the changes to mathicscript and mathics-django since this change breaks them. I'll look in detail though when this isn't a draft. |
…he same as it's unicode equivalent to thei plain text representation
There was a problem hiding this comment.
If you are going to create a YAML file (which I think is a great idea), then I think those comments should be put in dictionary items.
For example:
wl-to-unicode:
AACute:
wl-unicode: "\xE1"
standard-unicode: "\xE1"
wl-long-name: "\[AAcute]"
standard-unicode-name: LATIN SMALL LETTER A WITH ACUTE
wl-unicode-name: LATIN SMALL LETTER A WITH ACUTE
Note that if there are missing items, that's okay.
For more specific dictionaries or maps needed in python, as before a program can used to make the conversion.
It might also be good to see if converting with Cython improves load/lookup time.
I don't find this particularly useful and I've written entirely too many conversion scripts already 😁️, but feel free to implement this and let me know if there's a usecase for it that I'm missing.
That's interesting, how would this work? |
There was a problem hiding this comment.
If you are going to create a YAML file (which I think is a great idea), then I think those comments should be put in dictionary items.
I don't find this particularly useful and I've written entirely too many conversion scripts already , but feel free to implement this and let me know if there's a usecase for it that I'm missing.
I'll write the conversion scripts. Let's not merge this then, until that's done.
With all of this growth, I think that all of this should be removed from Core and turned into its own module which understands scanning parsing and WL symbols.
Other tools like the pygments formatter and tools just for querying and converting Mathics are likely to want this, in debugging, reporting or in the tools that it provides. For example, a conversion tool where the user has the abiliy to specify turning all occurances of SMALL LATIN ACCUTE A into "`a" because I my tool is a TeX-like converter. And come to think of that, TeXFormat may want to make use of the unicode-name aspect too.
Having useful data in a YAML as a comment is just not good as having it available for such tools.
It might also be good to see if converting with Cython improves load/lookup time.
That's interesting, how would this work?
The same way it works now for say numbers.c, or pattern.c
I am very concerned about the loading time as a result of all the overhead that has just been added, and I am afraid that for this niggling concern over hundreds of symbols that most of the time most people aren't using or interested in we are adding maybe a second extra in startup time.
Some timing needs to be done before this goes in.
I suspect everything will be okay though if we drive this down to basically a load of a cython file which has all of the definitions and regular expressions patterns.
mathics/core/characters.py
Outdated
| # Load the data on characters | ||
| with open(os.path.join(ROOT_DIR, "data/characters.yml"), "r") as f: | ||
| _CHAR_DATA = yaml.load(f) | ||
| _CHAR_DATA = yaml.load(f, Loader=yaml.FullLoader) |
There was a problem hiding this comment.
As suggested before, time this, and if this is slow speed it up
by doing the conversion and preprocessing once at install time.
There was a problem hiding this comment.
Ohh, I see. You're idea is to load the YAML at install time and pickle the dictionary or something?
There was a problem hiding this comment.
- pickle
- cPickle
- json
- simplejson
- ujson
- yajl
Gives timing from slowest (pickle) to fastest (yajl). Since the code is there, we should try running it on our particular data to see which works best.
There was a problem hiding this comment.
Ok, I'm running the tests now. Apparently cpickle isn't used anymore though (see https://stackoverflow.com/questions/56562929/modulenotfounderror-no-module-named-cpickle-on-python-3-7-3/56563226#56563226)
No problem. I still need to fix some of the entries in the CSV tables. I'll let you know when I get those finished. Also, please add a
I think this is a great idea, but keep in mind there are good reasons for having this in Core:
I'll take a look at those when I get all of the tests to pass.
Surprisingly, the overhead doesn't look that big. I haven't tested it, and this is no excuse to not optimize the loading, but I haven't noticed a change. @rocky I'll work on the optimizations you proposed after I get the tests to pass, of course.
I agree, it's an easy fix overall (we just need to transfer the code from the |
There's also a maintainability benefit I noticed while fixing the test: by having everything in a single dictionary, like in ACup:
wl: ...
uni: ...
...we avoid duplication of data and make it easier to maintain. |
I don't understand or follow this argument. Mathics Core imports tons of things, like mpmath or sympy. The fact that Mathics Core uses those imports isn't a reason to include the code for those imported modules inside Core. The python Modules are split out from a larger body when it is found that the smaller part has a function that is useful and benefical that doesn't require the other parts. Here, translation tables, scanning and parsing is a useful entity on its own. If you want to use the scanning and parsing outside of Mathics such as in a pygments formatter (or css formater), then you do not need or want the evaluator, numpy, sympy, pint and so on. It is possible or likey that things like TeXForm (SphinxForm, LaTeXForm, RsTForm, etc) could live totally outside of Mathics core. One thing that seems to come up a lot is that someone has data produced by WL, that they'd like to export it to some other format that WL doesn't provide. |
mmatera
left a comment
There was a problem hiding this comment.
For me, it looks fine. Can we merge this and then split this up? Or would be better to start the new package?
|
I prefer to start with a new package. There is too much complexity here for me to deal with. There is a lot more that needs to be done, and in testing before I am comfortable and confident that this is safe, and those tests would be better done outside. |
|
Ok, I tested the performance of multiple data serialization/deserialization library to see which one loads our data the fastest. Here are the preliminary results:
Here's the script I used for testing: https://pastebin.com/SpwwtDJH. I also wanted to test orjson but I couldn't install it via PyPI. As you can see, JSON is the clear winner overall and ujson is the most promising library. Apparently ujson is considered somewhat unsafe (see https://pythonspeed.com/articles/faster-json-library/), but we should be good since we never load untrusted data (we only load our own JSON files). There's some other things I'd like to test too, such as:
|
|
I've tested if it's faster to load each table from a separate file or to load everything from a single big file (using ujson), here are the results:
I think the results are clear enough. Here is the script I used: https://pastebin.com/NjLddZDk |
|
@GarkGarcia This is awesome! Thanks for doing this! (At work we are using ujson for exactly the same reason) So the overall approach I'd suggest is to start with the nicely formatted YAML with the information arranged in a nice human-readable way for editing and reading. In fact it could doesn't have the be strict yaml - it could use https://pypi.org/project/yaml-include/ or one of the packages mentioned in https://stackoverflow.com/questions/528281/how-can-i-include-a-yaml-file-inside-another . If it is helpful to split the file up into smaller sections. Personally, I'd follow the same organization as is found on the WL site, whether one or many files. From the YAML file, then preprocess the information into one or more ugly JSON files which has the set of the tables needed for each application. For example, Mathics Core might use one of the JSON files, but a formatter might use a smaller or different JSON file. Each file has been customized its needs. Each of these JSON files will undoubtably have redundant information, both within a single file and redundancy between different files. Note this is in contrast to the YAML file (or files making up a one virtual YAML file). The redundancy here is because the JSON is organized for fast access for a particular application. For example in Mathicsscript there is mapping from WL unicode to standard unicode and back. So these are two tables in JSON (or really two different dictionary keys in one big JSON dictionary) while in the YAML the information appears ony once. The practice of autoderiving multiple (redundant) copies (for speed) from a non-redundant source is the sound practice that is used for eliminating potential problems when adds, deletes, and updates can happen as is the case here. |
|
@GarkGarcia Where are the test.json files used in the tests? |
|
I just did one more test (the one about the regexes). We essencialy have two ways to deal with the regexes:
Here are the results of my tests: load-pickled: 17.69692592299907 Again, the results are pretty clear. Here's the test script: https://pastebin.com/6weyTwzh. Notice I've hardcoded the regex source string. This is because if we go with the first approach it is assumed we have already loaded the rest of the information at this point (and loading the strings from JSON doesn't seem like such an overhead considering we will already be loading the massive tables). |
Here are all the files I used in the tests: tests.tar.gz |
I absolutelly agree. The tests show that precompiling the information to (ugly) JSON and loading them with ujson is clearly the best approach. I can get started with implementing this tomorrow.
Yes, but |
mathics/core/characters.py
Outdated
| @@ -23,20 +23,20 @@ def unicode_equivalent(k: str, v: dict): | |||
| # Conversion from WL to the fully qualified names | |||
| WL_TO_PLAIN_DICT = {re.escape(v["wl-unicode"]): f"\\[{k}]" | |||
There was a problem hiding this comment.
This is the kind of thing that can be computed once per install and put into a JSON table. And re_from keys then would work on a set or a list rather than a dictionary.
There was a problem hiding this comment.
This is the kind of thing that can be computed once per install and put into a JSON table. And re_from keys then would work on a set or a list rather than a dictionary.
Yep, I plan to move this stuff to a install-time script after I get the tests to pass
There was a problem hiding this comment.
Essentially, this is just a draft of the install-time script 😁️
There was a problem hiding this comment.
@rocky Alright, it looks like the tests are passing. Could you help me figure out where/how should I place the install script?
There was a problem hiding this comment.
Sure - but I can't do this right now.
Depending on how my day job coding goes, in a day or so I'll copy all of this to a new repostory and set that up. Thanks.
There was a problem hiding this comment.
I've said it before but I'll say it again. Overall I think this is great work! @GarkGarcia
It addresses one of the dark area of WL that we can bring light to. I think when this and the parser are put in a separate module there will be lots of uses of this outside of Mathics. I am pretty sure there are no other good and maintainable alternatives.
After this is
There was a problem hiding this comment.
Sure - but I can't do this right now.
Depending on how my day job coding goes, in a day or so I'll copy all of this to a new repostory and set that up. Thanks.
Great! I've set up the code in a way to makes it easy for you to extract it to a separate package: just copy and paste mathics/core/characters.py and transfer the section marked as INITIALIZATION at the beginning to the installation script.
I've said it before but I'll say it again. Overall I think this is great work! @GarkGarcia
Thanks!
It addresses one of the dark area of WL that we can bring light to. I think when this and the parser are put in a separate module there will be lots of uses of this outside of Mathics.
Yes, I can already see us using this module to generate the tables in the developer docs instead of the CSV files (the YAML file in here has all of the information they have + some other stuff)
By avoiding escaping the keys of our dictionaries as much as possible we make them moregeneric in a sence, which is good for the library
mathics.core.characters.aliased_characters was accidentally removed at some point
|
@rocky On regards to the name of the new library, I think "wl-named-character" is a good choice. I'd avoid adding "Mathics" to name since as you've comments this us usefull outside of Mathics too. On regards to API design, I think the interface could use some massaging to make it more general and more ergonomic. |
|
@GarkGarcia @mmatera https://github.com/Mathics3/mathics-scanner is where this and the other scanner code is now moved to. Would appreicate it if in the future you try this and update there. #1117 has a PR to remove the code that has been moved. |
|
The code in here was transferred to https://github.com/Mathics3/mathics-scanner. |
This is a follow up to #1107. I've added a dictionary that maps the code points of named characters to their qualified names. Furthermore, I've renamed
replace_wl_with_unicodetoreplace_wl_with_plain_textand added an optional parameter to it calleduse_unicode(it'sTrueby default) that indicates to the function that named characters that have a unicode equivalent should be replaced with it (instead of the fully qualified name). If a named character doesn't have a unicode equivalent or ifuse_unicodeis set toFalsethenreplace_wl_with_plain_textreplaces it with it's fully qualified name.