Skip to content

Type Annotations#320

Merged
ofajardo merged 39 commits into
Roche:pyfile_devfrom
nachomaiz:pyfile_typehints
Apr 10, 2026
Merged

Type Annotations#320
ofajardo merged 39 commits into
Roche:pyfile_devfrom
nachomaiz:pyfile_typehints

Conversation

@nachomaiz
Copy link
Copy Markdown
Contributor

@nachomaiz nachomaiz commented Feb 13, 2026

Hi @ofajardo!

This PR aims to fix #299, adding type annotations to all public interface functions and classes.

I based them on the docstrings and how I understand the code is operating with the different parameters and class attributes, but I might have missed something.

I wasn't able to compile the library in this machine, however I have done a runtime check of the type annotations to make sure everything runs in py3.10+.

How it works:

  • pyclasses.py:
    • I've written TypedDict classes for missing ranges and MR sets.
    • Because the instance is not meant to be initialized by the user, I've set the type annotations for optional parameters as the end type. It might be better to turn it into a dataclass or add default values of the same type.
  • pyreadstat.py:
    • Created a FileLike protocol with the methods read and seek.
    • Use os.PathLike for flexibility with os.fsencode
    • Added overloads to read functions for the different output format types.
    • Write functions accept any dataframe object supported by narwhals. Write functions accept either a pandas.DataFrame or a polars.DataFrame as the first argument.
    • Chunk- and multi-read functions only accept a PyreadstatReadFunction callable type. It's first argument must be a path/file-like object and it must return a tuple of data and metadata.
  • pyfunctions.py
    • Use narwhals type vars to signal the return of the same type of dataframe.
  • py.typed: to signal to type checkers that the public interface of the library is type hinted.

While I added type annotations, I saw a few issues with the docstrings. I took the liberty to sync them up with the type annotations.

I also used a formatter for the function signatures as they were getting unwieldy. This changed the formatting of some of the code within the functions, so let me know if you would prefer I revert those.

Looking forward to your feedback.

@ofajardo
Copy link
Copy Markdown
Collaborator

hi @nachomaiz thanks a lot! I am a bit snowed right now, but will take a look as soon as I get some time.
By looking at what you wrote here, I have two comments:

  • If you like, you can transform the metadata container to a data class if you think it will work better.
  • For the write functions, they should only accept a pandas or a polars dataframe, accepting any dataframe that narwhals would accept is misleading, because only those two are supported.

@jonathon-love please check this PR, probably it addresses the same as in yours?

@nachomaiz
Copy link
Copy Markdown
Contributor Author

nachomaiz commented Feb 16, 2026

Thank you for the feedback!

I've now changed metadata_container to a dataclass. I saw that all fields are assigned to in _readstat_parser, so I've removed the Nones and added equivalent, type-compatible values. datetime fields default to datetime.now, but hopefully it should be a minimal amount of extra compute.

I also took a quick look at @jonathon-love's PR, and I realized I was missing specific types for PathLike and np.ndarray, so I ported over some of those nicer type definitions. Could I ask for your thoughts as well?

I switched the dataframe types for the write functions to pandas.DataFrame | polars.DataFrame to keep the hints restricted to those libraries only.

Additionally, I've added py.typed to MANIFEST.in and setup.py, as I got reminded by @jonathon-love to include that as well.

@ofajardo
Copy link
Copy Markdown
Collaborator

hi @nachomaiz thanks for your efforts!

I have cloned your fork, compiled it (all ok) and then run the tests.

The test_basic.py and test_narwhalified.py fail with 3 errors. The origin is probably that in the metadata _container class, if you look carefully, there are some members like number_rows that before were by default None, and now you are defining them as 0, this raises an inconsistency when using metadataonly, which also breaks read_in_chunks when reading an export file. So, could you please review those members and adapt them to be as they were before?

oh now I see your comment

I saw that all fields are assigned to in _readstat_parser, so I've removed the Nones and added equivalent, type-compatible values. datetime fields default to datetime.now, but hopefully it should be a minimal amount of extra compute.

No, that is not correct, those values are not always set and therefore the None's need to be there, also do not default datetime.now but to None.

Another one: typing_test.py raises a lot of errors. I am less familiar with mypy so I have not checked what they are about.

Found 22 errors in 4 files (checked 1 source file)

I think you have to get a machine where you can compile pyreadstat and be able to run the tests, please run test_basic.py, test_narwhalified.py with backend==pandas and backend==polars and test_http_integratio.py. BTW,please rename typing_test.py to test_typing.py just to keep the naming pattern.

Last one:
In setup.py there are these two additions:

 package_data={"pyreadstat": ["py.typed"]},
    include_package_data=True,

I think this might be unnecessary if you included py.typed in the manifest. The issue with package data, is that on windows, when people install Python from the window app market store, it installs the package and package data into different places (can't remember exactly), and I am not sure if in such case the IDE will see the py.typed (maybe yes?). I had such an issue in the past when I had to deliver dll files for windows, and python was not able to find them. I think this has to be tested.

Otherwise it looks good! =) Speed is also the same as before when I converted the files from pyx to py, so it seems the dataclass change is neutral in terms of performance.

@nachomaiz
Copy link
Copy Markdown
Contributor Author

Hello!

Thank you for reviewing and for your feedback!

I'm working on setting up a machine to be able to compile and run tests, will hopefully have it soon.

In the meantime, just wanted to get your thoughts on a couple of the things you mentioned above.

I have now gone through the code a bit more carefully and found the places where the num_rows distinction between 0 and None is made, and I see that it's generally related to POR and XPORT (?) files not having row counts in their metadata, and how that interacts with metadataonly=True and chunk/multiprocessing reads...

What makes it a bit complicated in my view is that if we set num_rows as int | None, any access of num_rows for any other file type will always need to be preceded by:

if meta.number_rows is not None:
    ...

Which may be the easier way to handle things in the end, but also feels redundant when for many users it would never be None. Would there be any alternatives? Maybe a subclass of metadata_container only for POR and XPORT files, in which number_rows can be None as well? But you're probably more familiar with the code in terms of other potential ways to keep the logic working.

...

On the typing_tests.pyi file, do note that it's a PYI file, so it's not executable. This is just to run mypy tests against it, with mypy tests/typing_tests.pyi. I was worried that naming it as test_typing.py would make the test runner think it's a file with actual runtime tests. I suppose since it's a PYI file I could rename it to test_typing.pyi and it should be ok.

There are a few rows which should error, there should be 5 errors in that file (there are comments in the file where it points them out). It also analyzes other files as the test file imports them, so I'm ignoring those files for the purpose of these annotations.

I noticed that there's an import error in the file so at the moment it doesn't work correctly, I'll fix that soon. I should also mention that both polars and pandas-stubs must be installed for mypy to do the type checks correctly.

I'll fix those few bits and remove the extra setup.py lines in my next batch of commits, but I'd be keen to hear your thoughts on alternatives to setting int | None for all types of files.

@ofajardo
Copy link
Copy Markdown
Collaborator

hi @nachomaiz

Regarding the topic of num_rows being int or None, None signals that it was not possible to recover the information from the metadata and therefore it is undefined. It is not correct to say that happens only for POR and XPORT files, theoretically can happen to any file type if the writing application did not write that information, for example in the case of SPSS SAV files, some applications do not write the number of rows and therefore cannot be determined and should stay as None (see for example #109).

However, I am not 100% sure of what the problem is ... this is the way it has been for years and there has been no problems so far. I am also reluctant to change the interface unless it is strictly necessary. So can you please explain a bit more what your concern is? If you mean the user needs to check the possibility that num_rows is None, yes, the user should do that if wants to be strict, no way around that, for the reason explained before.

Please also notice that I would like all the members that were None before to stay as they are, not only num_rows.

@nachomaiz
Copy link
Copy Markdown
Contributor Author

Ok! That makes sense. My mistake for assuming things. 😅

Will bring back all the None values, try to run the tests, and push new commits, aiming for later today.

Hopefully that gets it to a good place to merge!

@ofajardo
Copy link
Copy Markdown
Collaborator

hi @nachomaiz thanks!

Regarding the typing tests, please indicate in the comment at the top of the file, where you indicate that it has to be run with mypy, which other packages need to be installed in order to run.

We need the tests to be executable, they should have assertions which should all of them pass if everything is fine and fail if something is wrong. These tests will be then run in order to make the wheels and expected to pass, so reveal_type is not enough. So please transform your tests into an executable and rename it as suggested before. I have never done this, so not sure what is better, a quick search says you can use either assert_type (would be nice as no extra package needed, then you could do similarly as test_narwhalified.py) or pytest-mypy-plugin (would require to install extra stuff, but apparently you can write negative tests more easily).

@nachomaiz
Copy link
Copy Markdown
Contributor Author

Hi @ofajardo!

Thanks for making the change!

I've merged it into the PR and I'm making some of the changes we discussed. I'll finish up soon with a bit more of a write up with the changes and answering your questions about design choices.

@nachomaiz
Copy link
Copy Markdown
Contributor Author

nachomaiz commented Apr 8, 2026

Hi again!

Ok, so I've updated the PR with a few changes, I'll try to list them and explain the reasoning behind them, so please excuse the long text. 😅

  • Changed DictOutput to represent dict[str, list[Any] as we discussed in all relevant signatures and type checks.
  • Addressed the issue with flaky and long type error messages by using the regex features from pytest-mypy-plugins. These still validate that, in error tests, only the first argument to chunked read functions is the one that fails, but ignores the long and complicated types in error messages.
  • Removed parametrization of PyreadstatReadFunction (the ParamSpec declaration).
    • I was hoping to use this to narrow the types of kwargs, but the Python typing spec does not yet support parametrized functions without explicit parametrized *args, so it wasn't doing anything and I thought best to remove it for simplicity.
  • You were right to call out the imports within TYPE_CHECKING; they were wrong, and I've hopefully fixed them by defining dummy classes instead.
    • The idea was that, in order to support pandas and polars as optional backends, we should not attempt to import them at runtime, but we want to benefit from their type definitions for type checking purposes.
    • We use TYPE_CHECKING for that, which is defined as a variable that should always be True when type checkers evaluate the code, but is always False at runtime, so the code inside it never gets executed, but type checkers can see it. That way, we can use pandas and polars types in type hints. A side effect is that they must be declared as string type aliases (DataFrame: TypeAlias = "PandasDataFrame"), and we use TypeAlias to mark that the string does not represent a literal, but a "type".
      • Quick side note, I've added the TypeAlias tag to the other type aliases to signal that they are not supposed to be used as actual type objects at runtime.
    • The problem with my original approach was that, when an import fails in type checking, the types get turned to Unknown, which is an alias of Any, so all types were allowed. Declaring dummy classes inside TYPE_CHECKING means we can use them as "type sentinels" and reject any type that is not a known, successfully imported DataFrame, or the dummy class (which it wouldn't be possible normally, unless someone is trying to do weird things by using their type declarations from pyreadstat while not having pandas or polars installed).
    • Finally, I've declared the DataFrame union type outside of the block as a TypeAlias, so the name exists at runtime and might make it easier when introspecting types at runtime, but it doesn't really seem to make a difference otherwise.
    • I've also added a few comments which hopefully explain the intent of each piece of code.
  • While I'm on this file, just another quick note that the FileLike protocol is based on the structure of the object that the _readstat_parser.pyx functions expect, i.e. having a read and a seek method. This means any class that behaves like a file object would be accepted. I guess we could in theory use a Buffer protocol, which most file objects would likely implement, but then the code to read from the object would need to change to deal with raw memory buffers. Probably very out of scope for this PR.
  • I've fixed a few more small issues through the pyreadstat.py file: made all overloads use ... and added mentions of polars in docstrings.
  • I've gone ahead and added an __all__ declaration to the library. It wasn't too much work anyway, and it sort of is part of having a more accessible public API. 😄
    • I noticed that read_sav was imported twice, so I removed it. I've done the same with an unused import of copy in pyfunctions.py.
    • I've also added an import for PyreadstatError to expose the full public API in the __init__ module.
  • This PR introduces several test dependencies, so I think it's a good choice to implement dependency groups. I've gone ahead and added two dependency groups: dev and test.
    • Dev contains the required dependencies to compile and run the library.
    • Test contains the required dependencies to test all functionality, as well as the typing checks.
    • The command to install them is available in pip>=25.1: pip install --group dev --group test.
    • I've added relatively modern minimum versions that should work with Python 3.11 (tested by running the install command above).
    • Given this complication and the added complexity of the test commands, I've added a few notes and commands to the how_to_test.md file which hopefully help with referencing and simplify the process overall.
    • Thanks to the command to run all tests consecutively in there, it was quite easy to check for any issues before making these commits. 😄

There are still some typing issues and inconsistencies that are unfortunately not supported, mostly to do with function signatures and overloads. I know of a few, but there probably are more edge cases out there that I'm not aware of:

  • Some type checkers (at least Pylance that I have access to) didn't like that the chunked read functions aren't parametrized, and had trouble identifying the right types. Don't think this can be solved with current tooling.
  • Any function with a file/path-like first argument, more arguments, and returning a tuple of a dict or dataframe and a metadata container might trigger a false positive and pass checks, but I don't think it's avoidable without parametrization. At least the expected function signature is pretty unique, so it might be a nice unintended feature if someone wants to write a custom read function on top of the ones here?
  • Passing output_format as part of a **kwargs dict in any the read functions won't trigger the overloads in some type checkers at least, so the output type won't be narrowed, but that's a current limitation of parametrization and the typing spec as well.

Hopefully this clarifies the changes and thinking behind them. Let me know if you have any other lingering questions, or if you prefer I revert any of the new changes.

Thanks for your patience and help! Hope this gets it close to merging!

@ofajardo
Copy link
Copy Markdown
Collaborator

ofajardo commented Apr 9, 2026

hi @nachomaiz thanks a lot for your detailed and hard work! I think we are going to be there soon!

I am reviewing and so far everything looks excellent. But I got a few errors when running the typing tests, at least the first one seems to be something related to mypy version, I am using 1.20.0, maybe you are using an older one? if so I would recommend to update things to the newest version, I haven't checked the other errors, many seem the same thing, but there are a couple that may be something else, could you please check?

_________ read_sav_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, builtins.list[Any]]] _________
/myhome/rancher/githubrepos/pub/pyreadstat_nachomaiz/tests/test_typing.yml:25:
E   pytest_mypy_plugins.utils.TypecheckAssertionError: Invalid output:
E   Actual:
E     main:4: note: Revealed type is "dict[str, list[Any]]" (diff)
E   Expected:
E     main:4: note: Revealed type is "builtins.dict[builtins.str, builtins.list[Any]]" (diff)
E   Alignment of first line difference:
E     E: ...te: Revealed type is "builtins.dict[builtins.str, builtins.list[Any]]...
E     A: ...te: Revealed type is "dict[str, list[Any]]"...
E                                 ^
_________ read_dta_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, builtins.list[Any]]] _________
/myhome/rancher/githubrepos/pub/pyreadstat_nachomaiz/tests/test_typing.yml:52:
E   pytest_mypy_plugins.utils.TypecheckAssertionError: Invalid output:
E   Actual:
E     main:3: note: Revealed type is "dict[str, list[Any]]" (diff)
E   Expected:
E     main:3: note: Revealed type is "builtins.dict[builtins.str, builtins.list[Any]]" (diff)
E   Alignment of first line difference:
E     E: ...te: Revealed type is "builtins.dict[builtins.str, builtins.list[Any]]...
E     A: ...te: Revealed type is "dict[str, list[Any]]"...
E                                 ^
_________ read_por_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, builtins.list[Any]]] _________
/myhome/rancher/githubrepos/pub/pyreadstat_nachomaiz/tests/test_typing.yml:79:
E   pytest_mypy_plugins.utils.TypecheckAssertionError: Invalid output:
E   Actual:
E     main:3: note: Revealed type is "dict[str, list[Any]]" (diff)
E   Expected:
E     main:3: note: Revealed type is "builtins.dict[builtins.str, builtins.list[Any]]" (diff)
E   Alignment of first line difference:
E     E: ...te: Revealed type is "builtins.dict[builtins.str, builtins.list[Any]]...
E     A: ...te: Revealed type is "dict[str, list[Any]]"...
E                                 ^
______ read_sas7bdat_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, builtins.list[Any]]] _______
/myhome/rancher/githubrepos/pub/pyreadstat_nachomaiz/tests/test_typing.yml:106:
E   pytest_mypy_plugins.utils.TypecheckAssertionError: Invalid output:
E   Actual:
E     main:3: note: Revealed type is "dict[str, list[Any]]" (diff)
E   Expected:
E     main:3: note: Revealed type is "builtins.dict[builtins.str, builtins.list[Any]]" (diff)
E   Alignment of first line difference:
E     E: ...te: Revealed type is "builtins.dict[builtins.str, builtins.list[Any]]...
E     A: ...te: Revealed type is "dict[str, list[Any]]"...
E                                 ^
________ read_xport_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, builtins.list[Any]]] ________
/myhome/rancher/githubrepos/pub/pyreadstat_nachomaiz/tests/test_typing.yml:133:
E   pytest_mypy_plugins.utils.TypecheckAssertionError: Invalid output:
E   Actual:
E     main:3: note: Revealed type is "dict[str, list[Any]]" (diff)
E   Expected:
E     main:3: note: Revealed type is "builtins.dict[builtins.str, builtins.list[Any]]" (diff)
E   Alignment of first line difference:
E     E: ...te: Revealed type is "builtins.dict[builtins.str, builtins.list[Any]]...
E     A: ...te: Revealed type is "dict[str, list[Any]]"...
E                                 ^
______ read_sas7bcat_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, builtins.list[Any]]] _______
/myhome/rancher/githubrepos/pub/pyreadstat_nachomaiz/tests/test_typing.yml:160:
E   pytest_mypy_plugins.utils.TypecheckAssertionError: Invalid output:
E   Actual:
E     main:3: note: Revealed type is "dict[str, list[Any]]" (diff)
E   Expected:
E     main:3: note: Revealed type is "builtins.dict[builtins.str, builtins.list[Any]]" (diff)
E   Alignment of first line difference:
E     E: ...te: Revealed type is "builtins.dict[builtins.str, builtins.list[Any]]...
E     A: ...te: Revealed type is "dict[str, list[Any]]"...
E                                 ^
_ read_file_multiprocessing_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, builtins.list[Any]]] _
/myhome/rancher/githubrepos/pub/pyreadstat_nachomaiz/tests/test_typing.yml:188:
E   pytest_mypy_plugins.utils.TypecheckAssertionError: Invalid output:
E   Actual:
E     main:3: note: Revealed type is "dict[str, list[Any]]" (diff)
E   Expected:
E     main:3: note: Revealed type is "builtins.dict[builtins.str, builtins.list[Any]]" (diff)
E   Alignment of first line difference:
E     E: ...te: Revealed type is "builtins.dict[builtins.str, builtins.list[Any]]...
E     A: ...te: Revealed type is "dict[str, list[Any]]"...
E                                 ^
___ read_file_in_chunks_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, builtins.list[Any]]] ____
/myhome/rancher/githubrepos/pub/pyreadstat_nachomaiz/tests/test_typing.yml:215:
E   pytest_mypy_plugins.utils.TypecheckAssertionError: Invalid output:
E   Actual:
E     main:3: note: Revealed type is "dict[str, list[Any]]" (diff)
E   Expected:
E     main:3: note: Revealed type is "builtins.dict[builtins.str, builtins.list[Any]]" (diff)
E   Alignment of first line difference:
E     E: ...te: Revealed type is "builtins.dict[builtins.str, builtins.list[Any]]...
E     A: ...te: Revealed type is "dict[str, list[Any]]"...
E                                 ^
______________________________________________________ worker_types _______________________________________________________
/myhome/rancher/githubrepos/pub/pyreadstat_nachomaiz/tests/test_typing.yml:307:
E   pytest_mypy_plugins.utils.TypecheckAssertionError: Invalid output:
E   Actual:
E     main:5: note: Revealed type is "pandas.core.frame.DataFrame | polars.dataframe.frame.DataFrame | dict[str, list[Any]]" (diff)
E     ...
E   Expected:
E     main:5: note: Revealed type is "pandas.core.frame.DataFrame | polars.dataframe.frame.DataFrame | builtins.dict[builtins.str, builtins.list[Any]]" (diff)
E     ...
E   Alignment of first line difference:
E     E: ...ataframe.frame.DataFrame | builtins.dict[builtins.str, builtins.list[...
E     A: ...ataframe.frame.DataFrame | dict[str, list[Any]]"...
E                                      ^
________________________________________________ metadata_container_types _________________________________________________
/myhome/rancher/githubrepos/pub/pyreadstat_nachomaiz/tests/test_typing.yml:337:
E   pytest_mypy_plugins.utils.TypecheckAssertionError: Invalid output:
E   Actual:
E     ...
E     main:24: error: Missing keys ("counted_value", "is_dichotomy", "label", "type", "variable_list") for TypedDict "MRSet"  [typeddict-item] (diff)
E     ...
E   Expected:
E     ...
E     main:24: error: Missing keys ("type", "is_dichotomy", "counted_value", "label", "variable_list") for TypedDict "MRSet"  [typeddict-item] (diff)
E     ...
E   Alignment of first line difference:
E     E: ...rror: Missing keys ("type", "is_dichotomy", "counted_value", "label",...
E     A: ...rror: Missing keys ("counted_value", "is_dichotomy", "label", "type",...
E                                ^
================================================= short test summary info =================================================
FAILED tests/test_typing.yml::read_sav_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, builtins.list[Any]]] -
FAILED tests/test_typing.yml::read_dta_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, builtins.list[Any]]] -
FAILED tests/test_typing.yml::read_por_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, builtins.list[Any]]] -
FAILED tests/test_typing.yml::read_sas7bdat_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, builtins.list[Any]]] -
FAILED tests/test_typing.yml::read_xport_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, builtins.list[Any]]] -
FAILED tests/test_typing.yml::read_sas7bcat_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, builtins.list[Any]]] -
FAILED tests/test_typing.yml::read_file_multiprocessing_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, builtins.list[Any]]] -
FAILED tests/test_typing.yml::read_file_in_chunks_output_types[output_format=dict,expected_type=builtins.dict[builtins.str, builtins.list[Any]]] -
FAILED tests/test_typing.yml::worker_types -
FAILED tests/test_typing.yml::metadata_container_types -
======================================== 10 failed, 40 passed in 194.87s (0:03:14) ========================================

@nachomaiz
Copy link
Copy Markdown
Contributor Author

nachomaiz commented Apr 9, 2026

Huh, interesting!

So yes, I was using 1.19.1 up until now. It looks like 1.20.0 was released last week, and looking at the release notes it's not very apparent what changed that would have made those fail now...

But yeah, it seems like their rewrite of the type cache format they talk about in the notes has changed the behavior so that builtins. is not included in the types inside the error output, and it has sorted the keys of the MRSet typed dictionary alphabetically in the error output as well. But it also makes the test runs much faster now! 😄

I've fixed those issues to work with the new version, and bumped the version of mypy in the pyproject.toml file. Let me know if they work on your end too!

@ofajardo ofajardo merged commit 2ba7945 into Roche:pyfile_dev Apr 10, 2026
@ofajardo
Copy link
Copy Markdown
Collaborator

hey @nachomaiz, I merged! thanks again for all the hard work so far! What happens now is, I am going to prepare everything to send this branch to the CI/CD pipeline. If everything passess there, I will upload to pipy test, I will let you know to test the package, and then finally I do a release. I also asked claude to write a script to test the types at run time, I will let you know once I put it for you to take a look.

@nachomaiz
Copy link
Copy Markdown
Contributor Author

Amazing, thank you very much! I'll keep the branch open until then and delete after all looks good.

I learned a lot about typing, cython and this library so I'm very grateful for all your help and patience.

Will check on test PyPI once you let me know!

@ofajardo
Copy link
Copy Markdown
Collaborator

hi @nachomaiz big success! the CI/CD pipeline worked well!. Maybe, could you manually download the wheel from here and try it?, it is a bit easier than uploading to pipy ....

The new test file is here in case you would like to take a look.

The only side effect of all of this is that Pyreadstat is not working anymore for Python 3.10 (see here ). As I said before, I am ok with that. However, I was expecting that now that we are not using numpy types, it would work for 3.10, but actually it fails at importing pyreadstat:

import pyreadstat
    File "/tmp/tmp.XaWIF9ZXL2/venv/lib/python3.10/site-packages/pyreadstat/__init__.py", line 18, in <module>
      from .pyreadstat import read_sav, read_sas7bdat, read_xport, read_dta, read_por, read_sas7bcat
    File "/tmp/tmp.XaWIF9ZXL2/venv/lib/python3.10/site-packages/pyreadstat/pyreadstat.py", line 69, in <module>
      Concatenate[FilePathorBuffer, ...], "tuple[DataFrame | DictOutput, metadata_container]"
    File "/opt/python/cp310-cp310/lib/python3.10/typing.py", line 312, in inner
      return func(*args, **kwds)
    File "/opt/python/cp310-cp310/lib/python3.10/typing.py", line 403, in __getitem__
      return self._getitem(self, parameters)
    File "/opt/python/cp310-cp310/lib/python3.10/typing.py", line 599, in Concatenate
      raise TypeError("The last parameter to Concatenate should be a "
  TypeError: The last parameter to Concatenate should be a ParamSpec variable.

As I am not familiar with all this typing stuff yet, I asked claude what is the cause and what could be the solutions:

The error occurs at pyreadstat/pyreadstat.py:68-70 — a runtime type alias definition:

     PyreadstatReadFunction: TypeAlias = Callable[
         Concatenate[FilePathorBuffer, ...], "tuple[DataFrame | DictOutput, metadata_container]"
     ]

     The ... (Ellipsis) is used as the last argument to Concatenate. In Python 3.10, Concatenate strictly validates that its last parameter is a ParamSpec variable — ... is not, so it raises TypeError. In
     Python 3.11+, this validation was relaxed, allowing ... as a valid final argument.

     Why it's a runtime error (not just a type-checking error)

     This alias is defined at module level outside the if TYPE_CHECKING: block (lines 53-70 are unindented). It executes every time pyreadstat is imported, so Concatenate[FilePathorBuffer, ...] is evaluated
     immediately at import time — before any user code runs.

     Would it be easy to fix for 3.10?

     Yes — it's a small, localized change. There are two clean approaches:

     Option A: Move alias into TYPE_CHECKING + add from __future__ import annotations

     Move PyreadstatReadFunction inside the if TYPE_CHECKING: block and add from __future__ import annotations at the top of pyreadstat.py. This defers all annotation evaluation to strings, so
     PyreadstatReadFunction in function signatures (lines 935, 949, 963, 976, 1075, 1085, 1095, 1104) would not be evaluated at runtime. worker.py already imports it inside TYPE_CHECKING (line 24-25), so no
     change needed there.

     Option B: Simplify the type alias to avoid Concatenate

     Replace the alias with:
     PyreadstatReadFunction: TypeAlias = Callable[..., "tuple[DataFrame | DictOutput, metadata_container]"]
     This drops the Concatenate constraint (slightly less precise — doesn't enforce first arg is FilePathorBuffer) but works on all Python versions and still provides the return type information.

     Scope of impact

     - Concatenate is used in exactly 1 place in the entire codebase (line 69)
     - PyreadstatReadFunction is used in 8 function signatures in pyreadstat.py and 1 import in worker.py
     - No other typing constructs in the file have 3.10 compatibility issues

Option A looked sort of OK, so I asked what are the implications of doing that:

What from __future__ import annotations does

  It's not version-specific — it affects all Python versions (3.7+). When present, it changes how Python handles every annotation in that file: instead of evaluating them as live expressions at definition
  time, it stores them as plain strings.

  What we'd lose

  Runtime annotation introspection changes. If any code calls obj.__annotations__ or typing.get_type_hints() on functions in this file, it would get back strings like "PyreadstatReadFunction" instead of the
  actual resolved type object. This matters for frameworks that inspect annotations at runtime (e.g., FastAPI, Pydantic, dataclasses with field() validation).

  For pyreadstat, this is likely a non-issue — the file doesn't do runtime annotation introspection. But it's a broader behavioral change than strictly necessary for this one-line fix.

  Also worth noting: from __future__ import annotations was originally intended to become the default in Python 3.10 (PEP 563), but that was postponed indefinitely because it broke runtime uses of
  annotations in popular libraries. Python 3.13 introduced PEP 649 (deferred evaluation) as a different solution instead. So adding from __future__ import annotations is going slightly against the grain of
  where Python is heading.

  The targeted alternative

  Since the problem is only Concatenate[FilePathorBuffer, ...] on line 69, we can fix it more surgically:

  - Just wrap PyreadstatReadFunction in if TYPE_CHECKING: and quote the references to it in function signatures ("PyreadstatReadFunction" instead of bare PyreadstatReadFunction). This changes nothing about
  how the rest of the file behaves — no global annotation semantics change.

  This is more edits (8 function signatures to quote), but the blast radius is zero outside that one alias.

So, I like backwards compatibility, but the solutions look a bit odd to me, so I have not implemented anything. Do you have any thoughts on this?

@nachomaiz
Copy link
Copy Markdown
Contributor Author

nachomaiz commented Apr 10, 2026

Hmm I see...

Yeah I didn't realize that would make it not work with 3.10. The original code with the ParamSpec declaration did work with that version.

The reason I wanted to use Concatenate is that the read function signatures all take a mandatory FilePathorBuffer as the first argument, and then all other arguments are specific to each function. Replacing Concatenate with a single ellipsis [...] would allow function signatures without the expected first argument. In truth, it's all just a slightly fancier permissive type hint. I think (and hope) the intent is clear with the name of the callable type, but using Concatenate would allow for a slightly more explicit type hint, where that first argument to the function must fulfill a specific expectation.

So we could revert that, which I think for consumers of the library would look pretty much the same as with the ellipsis.

If you'd rather avoid creating more PRs/branches, you could implement that in your branch by importing ParamSpec from typing and then reverting the function declaration to this, as it was in one of my earlier commits:

_P = ParamSpec("_P")
PyreadstatReadFunction: TypeAlias = Callable[
    Concatenate[FilePathorBuffer, _P], "tuple[DataFrame | DictOutput, metadata_container]"
]

Alternatively, I would actually suggest that, instead of moving the declaration inside TYPE_CHECKING, the whole type declaration could be put inside string quotes (rather than just the return value, the quotes around that return would need to be removed in this case). The TypeAlias definition would keep it tagged as a type expression, and we wouldn't need any changes elsewhere in the file.

It would look like this:

PyreadstatReadFunction: TypeAlias = "Callable[
    Concatenate[FilePathorBuffer, ...], tuple[DataFrame | DictOutput, metadata_container]
]"

I believe most type checkers tend to backport these sematic changes to previous versions, even if they would fail at runtime, and expect them to be declared inside string quotes, so it should work unless I am not correct. Might need to see if the runtime type checks and the typing checks are happy with that.

@ofajardo
Copy link
Copy Markdown
Collaborator

hi @nachomaiz thanks for the suggestion. So, I implemented the change in this commit. Now all tests passes with python 3.10 and also with 3.13. Changes are minimal, but I am not sure what are the implications of doing this, do you think it has any side effect? It would be nice to support python 3.10 but apparently it is also sunseting and end of support is end of this year.

@nachomaiz
Copy link
Copy Markdown
Contributor Author

nachomaiz commented Apr 13, 2026

Hi! No worries, and happy to explain a bit more.

Firstly, I think for the use cases of the library I very highly doubt it will have any real side effect. Especially since type hints will be new, so nobody should have had the chance to find creative ways to use them yet. 😄

The "stringifying" of the types is done so that Python doesn't execute the code when declaring the type.

To a type checker this works exactly as if the types were declared without the quotes, but Python just sees it as a str.

This must be done when the type expression contains undeclared types or types that are unrepresentable at runtime.

An example of the former:

class A:
    def f() -> "B":  # B is not declared yet, so we use "B" (not needed in Python 3.14+)
        return B()

class B:
    pass

The DataFrame declaration is an example of the latter, if we were to declare the type at runtime, we would need both pandas and polars to be installed, since they need to be imported, so the declaration is done as a string type so it doesn't get executed, but I moved it outside of the TYPE_CHECKING block so it exists as a type alias at runtime.

Anything that lives inside TYPE_CHECKING also doesn't exist at runtime, so it needs to be used as string in type declarations.

And for the PyreadstatReadFunction type, this means that even though the ellipsis with concatenate is not allowed in 3.10, it won't actually run, so Python won't throw errors at runtime. Type checkers tend to turn annotations that can't be figured out to Unknown or Any, so we move the failure point to type checking, specifically in Python 3.10 (which is EOL soon), and with a type checker that doesn't backport typing features, and otherwise it works for every other case.

All that said, most of this is almost fully obsolete since Python 3.14, as they have essentially made the from __future__ import annotations behavior the default. See this for more info: https://peps.python.org/pep-0749/, but this translates to being able to declare type hints without the quotes in 3.14+.

I imagine string types will be supported for a while, but if that changes it probably would be after 3.14 is EOL, and the fix is to remove the quotes and use the type Dataframe = ... syntax from Python 3.12.

So to summarize, I don't think it will have any real side effect for users, and Python is moving towards type hints behaving essentially the same as this from now on.

@ofajardo
Copy link
Copy Markdown
Collaborator

ofajardo commented Apr 14, 2026

hi @nachomaiz thanks a lot for the explanation! I have added some comments to revert the changes once 3.10 is EOL. Now, wheels are succesfully built in all platforms and also for python 3.10.

Please give them a try installing from here:

pip install -i https://pypi.anaconda.org/ofajardo/simple pyreadstat

Can you see the annotations you are expecting in your IDE?

if everything is OK, then I do the release

@nachomaiz
Copy link
Copy Markdown
Contributor Author

No worries, it took me a while to learn it all so happy to share!

I just downloaded and tested the new version on VSCode and I can say that it's working beautifully.

Type definitions are now shown and correct, and "go to definition" functionality from the language server is working great as well!

Is there a plan for supporting Python 3.14? I defaulted to trying it on that version by reflex and pip said it wasn't supported, but it worked great on 3.13! 😄

@ofajardo
Copy link
Copy Markdown
Collaborator

hmm ... python 3.14 IS supported ... so I wonder why you cannot install it ... can you maybe check directly on the anaconda.org/ofajardo/pyreadstat page ... the wheel should be there and if you download it you should be able to install it

@ofajardo
Copy link
Copy Markdown
Collaborator

I tried myself on python 3.14 with the pip command I shared and it failed because could not resolve the dependency narwhals. What happens is because in the pip command I am saying the index must be my anaconda org repo, and that one does not have narwhals, it cannot resolve the dependency. I installed narwhals manually from the normal pipy and then tried again from my ananconda repo and this time the installation worked.

Then I uploaded to the test pipy and same thing, test pipy does not have narwhas either, so same issue and same solution.

I assume you had the same issue, and I feel confident that it will work in the normal pipy, so I am going to proceed to release.

@ofajardo
Copy link
Copy Markdown
Collaborator

OK, release done! In my hands installing from pipy for 3.14 works without issues.

@nachomaiz thanks a lot for the great contribution and awesome collaboration!

@nachomaiz
Copy link
Copy Markdown
Contributor Author

Ah it was no problem. Thank you as well for your patience while I familiarized myself with the code. I've learned a lot about typing, testing type hints, Cython and working with GitHub branches, etc. as well.

Looking forward to using the new version in my code!

@nachomaiz nachomaiz deleted the pyfile_typehints branch April 15, 2026 12:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants