fix: Correct null_count in describe()#10260
Conversation
datafusion/expr/src/expr_fn.rs
Outdated
| )) | ||
| } | ||
|
|
||
| /// Create an expression to represent the count_null() aggregate function |
There was a problem hiding this comment.
I think this function is creating a function like COUNT(IS NULL x) rather than a count_null function
datafusion/core/src/dataframe/mod.rs
Outdated
| original_schema_fields | ||
| .clone() | ||
| .map(|f| count(is_null(col(f.name()))).alias(f.name())) | ||
| .map(|f| count_null(col(f.name())).alias(f.name())) |
There was a problem hiding this comment.
Instead of a count_null() new function, what do you think about using the existing count function?
Something like count(expr.is_null())
| .map(|f| count_null(col(f.name())).alias(f.name())) | |
| .map(|f| count(col(f.name()).is_null()).alias(f.name()))) |
There was a problem hiding this comment.
When using count, it sums all non-null Expr while is_null converts Expr to a boolean expression, which is why it doesn't work.
There was a problem hiding this comment.
Got it. Make sense to me. Another classic trick is to use a CASE expression, something like
SELECT SUM(CASE WHEN x IS NULL THEN 1 ELSE 0 END) as null_count ...| "+------------+-------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----------------+------------+-------------------------+--------------------+-------------------+", | ||
| "| count | 7300.0 | 7300 | 7300.0 | 7300.0 | 7300.0 | 7300.0 | 7300.0 | 7300.0 | 7300 | 7300 | 7300 | 7300.0 | 7300.0 |", | ||
| "| null_count | 7300.0 | 7300 | 7300.0 | 7300.0 | 7300.0 | 7300.0 | 7300.0 | 7300.0 | 7300 | 7300 | 7300 | 7300.0 | 7300.0 |", | ||
| "| null_count | 0.0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 0.0 | 0.0 |", |
a3f58c4 to
3a22207
Compare
alamb
left a comment
There was a problem hiding this comment.
Thanks @Weijun-H - I think this PR needs a test for data that has actual nulls in it
I checked the file the test is using and I don't think it has any null values in it
> select * from './parquet-testing/data/alltypes_tiny_pages.parquet';
+-----+----------+-------------+--------------+---------+------------+-----------+--------------------+-----------------+------------+-------------------------+------+-------+
| id | bool_col | tinyint_col | smallint_col | int_col | bigint_col | float_col | double_col | date_string_col | string_col | timestamp_col | year | month |
+-----+----------+-------------+--------------+---------+------------+-----------+--------------------+-----------------+------------+-------------------------+------+-------+
| 122 | true | 2 | 2 | 2 | 20 | 2.2 | 20.2 | 01/13/09 | 2 | 2009-01-13T01:02:05.410 | 2009 | 1 |
| 123 | false | 3 | 3 | 3 | 30 | 3.3 | 30.299999999999997 | 01/13/09 | 3 | 2009-01-13T01:03:05.430 | 2009 | 1 |
| 124 | true | 4 | 4 | 4 | 40 | 4.4 | 40.4 | 01/13/09 | 4 | 2009-01-13T01:04:05.460 | 2009 | 1 |
| 125 | false | 5 | 5 | 5 | 50 | 5.5 | 50.5 | 01/13/09 | 5 | 2009-01-13T01:05:05.500 | 2009 | 1 |
| 126 | true | 6 | 6 | 6 | 60 | 6.6 | 60.599999999999994 | 01/13/09 | 6 | 2009-01-13T01:06:05.550 | 2009 | 1 |
| 127 | false | 7 | 7 | 7 | 70 | 7.7 | 70.7 | 01/13/09 | 7 | 2009-01-13T01:07:05.610 | 2009 | 1 |
| 128 | true | 8 | 8 | 8 | 80 | 8.8 | 80.8 | 01/13/09 | 8 | 2009-01-13T01:08:05.680 | 2009 | 1 |
| 129 | false | 9 | 9 | 9 | 90 | 9.9 | 90.89999999999999 | 01/13/09 | 9 | 2009-01-13T01:09:05.760 | 2009 | 1 |
| 130 | true | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 01/14/09 | 0 | 2009-01-14T01:10:05.850 | 2009 | 1 |
| 131 | false | 1 | 1 | 1 | 10 | 1.1 | 10.1 | 01/14/09 | 1 | 2009-01-14T01:11:05.850 | 2009 | 1 |
| 132 | true | 2 | 2 | 2 | 20 | 2.2 | 20.2 | 01/14/09 | 2 | 2009-01-14T01:12:05.860 | 2009 | 1 |
| 133 | false | 3 | 3 | 3 | 30 | 3.3 | 30.299999999999997 | 01/14/09 | 3 | 2009-01-14T01:13:05.880 | 2009 | 1 |
| 134 | true | 4 | 4 | 4 | 40 | 4.4 | 40.4 | 01/14/09 | 4 | 2009-01-14T01:14:05.910 | 2009 | 1 |
| 135 | false | 5 | 5 | 5 | 50 | 5.5 | 50.5 | 01/14/09 | 5 | 2009-01-14T01:15:05.950 | 2009 | 1 |
| 136 | true | 6 | 6 | 6 | 60 | 6.6 | 60.599999999999994 | 01/14/09 | 6 | 2009-01-14T01:16:06 | 2009 | 1 |
| 137 | false | 7 | 7 | 7 | 70 | 7.7 | 70.7 | 01/14/09 | 7 | 2009-01-14T01:17:06.600 | 2009 | 1 |
| 138 | true | 8 | 8 | 8 | 80 | 8.8 | 80.8 | 01/14/09 | 8 | 2009-01-14T01:18:06.130 | 2009 | 1 |
| 139 | false | 9 | 9 | 9 | 90 | 9.9 | 90.89999999999999 | 01/14/09 | 9 | 2009-01-14T01:19:06.210 | 2009 | 1 |
| 140 | true | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 01/15/09 | 0 | 2009-01-15T01:20:06.300 | 2009 | 1 |
| 141 | false | 1 | 1 | 1 | 10 | 1.1 | 10.1 | 01/15/09 | 1 | 2009-01-15T01:21:06.300 | 2009 | 1 |
| 142 | true | 2 | 2 | 2 | 20 | 2.2 | 20.2 | 01/15/09 | 2 | 2009-01-15T01:22:06.310 | 2009 | 1 |
| 143 | false | 3 | 3 | 3 | 30 | 3.3 | 30.299999999999997 | 01/15/09 | 3 | 2009-01-15T01:23:06.330 | 2009 | 1 |
| 144 | true | 4 | 4 | 4 | 40 | 4.4 | 40.4 | 01/15/09 | 4 | 2009-01-15T01:24:06.360 | 2009 | 1 |
| 145 | false | 5 | 5 | 5 | 50 | 5.5 | 50.5 | 01/15/09 | 5 | 2009-01-15T01:25:06.400 | 2009 | 1 |
| 98 | true | 8 | 8 | 8 | 80 | 8.8 | 80.8 | 01/10/09 | 8 | 2009-01-10T00:38:04.330 | 2009 | 1 |
| 99 | false | 9 | 9 | 9 | 90 | 9.9 | 90.89999999999999 | 01/10/09 | 9 | 2009-01-10T00:39:04.410 | 2009 | 1 |
| 100 | true | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 01/11/09 | 0 | 2009-01-11T00:40:04.500 | 2009 | 1 |
| 101 | false | 1 | 1 | 1 | 10 | 1.1 | 10.1 | 01/11/09 | 1 | 2009-01-11T00:41:04.500 | 2009 | 1 |
| 102 | true | 2 | 2 | 2 | 20 | 2.2 | 20.2 | 01/11/09 | 2 | 2009-01-11T00:42:04.510 | 2009 | 1 |
| 103 | false | 3 | 3 | 3 | 30 | 3.3 | 30.299999999999997 | 01/11/09 | 3 | 2009-01-11T00:43:04.530 | 2009 | 1 |
| 104 | true | 4 | 4 | 4 | 40 | 4.4 | 40.4 | 01/11/09 | 4 | 2009-01-11T00:44:04.560 | 2009 | 1 |
| 105 | false | 5 | 5 | 5 | 50 | 5.5 | 50.5 | 01/11/09 | 5 | 2009-01-11T00:45:04.600 | 2009 | 1 |
| 106 | true | 6 | 6 | 6 | 60 | 6.6 | 60.599999999999994 | 01/11/09 | 6 | 2009-01-11T00:46:04.650 | 2009 | 1 |
| 107 | false | 7 | 7 | 7 | 70 | 7.7 | 70.7 | 01/11/09 | 7 | 2009-01-11T00:47:04.710 | 2009 | 1 |
| 108 | true | 8 | 8 | 8 | 80 | 8.8 | 80.8 | 01/11/09 | 8 | 2009-01-11T00:48:04.780 | 2009 | 1 |
| 109 | false | 9 | 9 | 9 | 90 | 9.9 | 90.89999999999999 | 01/11/09 | 9 | 2009-01-11T00:49:04.860 | 2009 | 1 |
| 110 | true | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 01/12/09 | 0 | 2009-01-12T00:50:04.950 | 2009 | 1 |
| 111 | false | 1 | 1 | 1 | 10 | 1.1 | 10.1 | 01/12/09 | 1 | 2009-01-12T00:51:04.950 | 2009 | 1 |
| 112 | true | 2 | 2 | 2 | 20 | 2.2 | 20.2 | 01/12/09 | 2 | 2009-01-12T00:52:04.960 | 2009 | 1 |
| 113 | false | 3 | 3 | 3 | 30 | 3.3 | 30.299999999999997 | 01/12/09 | 3 | 2009-01-12T00:53:04.980 | 2009 | 1 |
| . |
| . |
| . |
+-----+----------+-------------+--------------+---------+------------+-----------+--------------------+-----------------+------------+-------------------------+------+-------+
7300 row(s) fetched. (First 40 displayed. Use --maxrows to adjust)
Elapsed 0.015 seconds.
datafusion/core/src/dataframe/mod.rs
Outdated
| original_schema_fields | ||
| .clone() | ||
| .map(|f| count(is_null(col(f.name()))).alias(f.name())) | ||
| .map(|f| count_null(col(f.name())).alias(f.name())) |
There was a problem hiding this comment.
Got it. Make sense to me. Another classic trick is to use a CASE expression, something like
SELECT SUM(CASE WHEN x IS NULL THEN 1 ELSE 0 END) as null_count ...
datafusion/expr/src/expr_fn.rs
Outdated
| )) | ||
| } | ||
|
|
||
| /// Create an expression to represent the COUNT(IS NULL x) aggregate function |
There was a problem hiding this comment.
I found this confusing as first because it isn't COUNT(x IS NULL) -- I think a more accurate description is COUNT(x) FILTER (WHERE x IS NULL) or something like that
| /// Create an expression to represent the COUNT(IS NULL x) aggregate function | |
| /// Create an aggregate that returns the number `NULL` values in a column |
| "| describe | a | b |", | ||
| "+------------+------+------+", | ||
| "| count | 1 | 0 |", | ||
| "| null_count | 0 | 1 |", |
`null_count` was fixed upstream. Ref apache/datafusion#10260
* chore: upgrade datafusion Deps Ref #690 * update concat and concat_ws to use datafusion_functions Moved in apache/datafusion#10089 * feat: upgrade functions.rs Upstream is continuing it's migration to UDFs. Ref apache/datafusion#10098 Ref apache/datafusion#10372 * fix ScalarUDF import * feat: remove deprecated suppors_filter_pushdown and impl supports_filters_pushdown Deprecated function removed in apache/datafusion#9923 * use `unnest_columns_with_options` instead of deprecated `unnest_column_with_option` * remove ScalarFunction wrappers These relied on upstream BuiltinScalarFunction, which are now removed. Ref apache/datafusion#10098 * update dataframe `test_describe` `null_count` was fixed upstream. Ref apache/datafusion#10260 * remove PyDFField and related methods DFField was removed upstream. Ref: apache/datafusion#9595 * bump `datafusion-python` package version to 38.0.0 * re-implement `PyExpr::column_name` The previous implementation relied on `DFField` which was removed upstream. Ref: apache/datafusion#9595
Which issue does this PR close?
Closes #10231
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?