Skip to content

Disregard SAS and SPSS format width and precision when identifying date, datetime or time variables#332

Open
belegdol wants to merge 4 commits into
Roche:devfrom
belegdol:parse_formats_with_regex
Open

Disregard SAS and SPSS format width and precision when identifying date, datetime or time variables#332
belegdol wants to merge 4 commits into
Roche:devfrom
belegdol:parse_formats_with_regex

Conversation

@belegdol
Copy link
Copy Markdown

@belegdol belegdol commented May 12, 2026

This should help retain robust identification following WizardMac/ReadStat@5c44678.

@belegdol belegdol marked this pull request as ready for review May 12, 2026 08:02
@belegdol
Copy link
Copy Markdown
Author

It remains to be clarified how extra_datetime_formats etc. is to be handled. One option would be to update the documentation and specify that for SAS the formats need to be specified without width and precision, but this would mean different handling in comparison with SPSS as well as a breaking change to the existing code, which is not ideal.
A second option would be to extract the format name from the extra formats as well. Please let me know your preference.
Please also let me know if I should apply the same change to the SPSS codepath too.

@belegdol
Copy link
Copy Markdown
Author

On a second thought, breaking existing applications is a terrible idea. As such, I went ahead and implemented a way of handling extra formats provided with width and precision gracefully.

@ofajardo
Copy link
Copy Markdown
Collaborator

Please elaborate what is that change in Readstat, what is this PR trying to achieve (what is the current issue and what is this solving) and why you need to change the current mechanism, I am in principle against changing it.

@belegdol
Copy link
Copy Markdown
Author

The change in ReadStat fixes reading format width and precision from 32-bit SAS files, such as ones created with SAS for Windows. As a result, format like DATETIME22.3, which has previously incorrectly been read as DATETIME, is now correctly read as DATETIME22.3. Since DATETIME22.3 is not an element of sas_datetime_formats, the variable will end up incorrectly imported as float64 whereas it would have been correctly imported as datetime64 before. This can lead to regressions in cases which used to rely on the incorrect behaviour.
Instead of trying to list every format with every possible length and precision combination found in the wild, this PR aims to extract the name of the format and only compare it against the list of known working date, datetime and time formats. This way no --extra_datetime-format parameter will be required if a format in a dataset only differs in length or precision.
I will look into CI failures, apologies for missing these.

@ofajardo
Copy link
Copy Markdown
Collaborator

OK, now I get it, thanks for the explanation. Yes, makes fully sense to do this change. Please:

1- Make sure all tests pass
2- Please add one simple SAS file with one of the problematic formats (such as DATETIME22.3) to test_data/basic and implement a simple test in tests/test_narwhalified.py to check that the new format is read correctly, and also in connection with your other PR, the microseconds are read as expected.
3- Yes, it would be nice to have this to SPSS as well.

Thanks!

@belegdol
Copy link
Copy Markdown
Author

Existing tests should hopefully be fixed now. As far as a new test for this PR is concerned: would you mind pulling latest ReadStat to the dev branch? I can only easily generate the test files using Windows version of SAS, and the ReadStat snapshot currently in pyreadstat tree does not read the format widths or precision.

@belegdol belegdol changed the base branch from master to dev May 12, 2026 20:12
@belegdol belegdol changed the title Disregard SAS format width and precision when identifying date, datetime or time variables Disregard SAS and SPSS format width and precision when identifying date, datetime or time variables May 12, 2026
@ofajardo
Copy link
Copy Markdown
Collaborator

latest Readstat sources are now on dev

@ofajardo
Copy link
Copy Markdown
Collaborator

Since both PRs would potentially share a single test, maybe you would like to consolidate both in one PR?

@ofajardo
Copy link
Copy Markdown
Collaborator

Interestingly after updating Readstat sources, my tests are failing for the extra date formats, so my guess is that your PR will solve that =)

@belegdol belegdol force-pushed the parse_formats_with_regex branch from bac2564 to ab84cb2 Compare May 13, 2026 08:11
@belegdol
Copy link
Copy Markdown
Author

latest Readstat sources are now on dev

Thank you!

Since both PRs would potentially share a single test, maybe you would like to consolidate both in one PR?

I would prefer not to since what is being changed is only related in a way that it affects timestamp variables.

Interestingly after updating Readstat sources, my tests are failing for the extra date formats, so my guess is that your PR will solve that =)

This is actually good - as such, I do not think this PR no longer requires a separate test. As you have just experienced, with updated ReadStat the existing files no longer work because their now completely read formats are not in the list as it exists in current dev branch.
I have rebased my PR, if the existing tests run properly now, would this be sufficient?

@ofajardo
Copy link
Copy Markdown
Collaborator

I have rebased my PR, if the existing tests run properly now, would this be sufficient?

For this PR ok, yes, but for your other PR I still want a test with the new format that has fractions of seconds, and check that those fractional seconds are read as expected (as shown in SAS), so anyway the test I requested is still needed.

@belegdol
Copy link
Copy Markdown
Author

Of course, for the second PR I will provide a test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants