Skip to content

unicodedata: is_normalized claims nothing is normalized in any form when using the 3.2.0 database #101372

@zahlman

Description

@zahlman

Bug report

3.8 adds the .is_normalized function to the unicodedata module, which also is available as a method on the legacy unicodedata.ucd_3_2_0 database. It is supposed to check whether a string is equal to its normalization in a given form, but without having to normalize and compare.

However, the legacy version does not maintain the expected invariant. In fact, it reports that every single-character string is not normalized, regardless of the normalization form chosen. Presumably, the result is the same for every non-empty string. (It appears that the empty string works because it is special-cased at line 871-874.)

Example:

>>> import unicodedata
>>> unicodedata.ucd_3_2_0.normalize('NFC', '!') == '!'
True
>>> unicodedata.ucd_3_2_0.is_normalized('NFC', '!')
False
>>> any(unicodedata.ucd_3_2_0.is_normalized(form, chr(x)) for form in ('NFC', 'NFD', 'NFKC', 'NFKD') for x in range(0x110000))
False

The bug appears to be at line 801-804 of unicodedata.c:

    /* UCD 3.2.0 is requested, quickchecks must be disabled. */
    if (UCD_Check(self)) {
        return NO;
    }

I believe the NO should say MAYBE instead. The NO value appears to indicate that the quickcheck has determined that the string is not normalized - contrary to both the comment and expected behaviour.

Your environment

$ python
Python 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions