Add doc for the statistics_from_parquet_meta_calc method#15330
Add doc for the statistics_from_parquet_meta_calc method#15330xudong963 merged 1 commit intoapache:mainfrom
statistics_from_parquet_meta_calc method#15330Conversation
| /// For columns without statistics, | ||
| /// - For min/max, there are two situations: | ||
| /// 1. The column isn't in arrow schema, then min/max values are set to Precision::Absent | ||
| /// 2. The column is in arrow schema, but not in parquet schema due to schema revolution, min/max values are set to Precision::Exact(null) |
There was a problem hiding this comment.
In fact, I have questions about this behavior, shouldn't it be Precision::Absent?
There was a problem hiding this comment.
I think in this case, the default schema adapter will fill in the constant value null for all columns like this so Precision::Exact(null) is correct
However, as @adriangb found in #15263 and elsewhere when users use custom Schema adapters a value other than NULL is filled in
Maybe this is another place where the schema adapter could/should be used 🤔
There was a problem hiding this comment.
That makes sense, I'll try to make the potential bug surface, thanks @alamb
alamb
left a comment
There was a problem hiding this comment.
Thank you very much @xudong963 -- I think this makes the code better and describes how the code works that matches my understanding
Your question is a good one -- and I think it may be pointing at a potential bug here when using custom schema adapters. Nice!
| /// The statistics are calculated for each column in the table schema | ||
| /// using the row group statistics in the parquet metadata. | ||
| /// | ||
| /// # Key behaviors: |
| /// For columns without statistics, | ||
| /// - For min/max, there are two situations: | ||
| /// 1. The column isn't in arrow schema, then min/max values are set to Precision::Absent | ||
| /// 2. The column is in arrow schema, but not in parquet schema due to schema revolution, min/max values are set to Precision::Exact(null) |
There was a problem hiding this comment.
I think in this case, the default schema adapter will fill in the constant value null for all columns like this so Precision::Exact(null) is correct
However, as @adriangb found in #15263 and elsewhere when users use custom Schema adapters a value other than NULL is filled in
Maybe this is another place where the schema adapter could/should be used 🤔
Which issue does this PR close?
anyinstead offor_each#15289Rationale for this change
I'm refactoring the method
statistics_from_parquet_meta_calcin #15289, to make sure we're on the same page, I think it's better to write the doc in detail for the method.What changes are included in this PR?
Add doc
Are these changes tested?
Are there any user-facing changes?