Implement window functions with `partition_by` clause by jimexist · Pull Request #558 · apache/datafusion

jimexist · 2021-06-14T06:13:30Z

Which issue does this PR close?

Closes #299

Rationale for this change

with order by implemented, we can add partition by support.

What changes are included in this PR?

Are there any user-facing changes?

codecov-commenter · 2021-06-14T10:15:07Z

Codecov Report

Merging #558 (4a7a499) into master (e3e7e29) will decrease coverage by 0.03%.
The diff coverage is 81.01%.

@@            Coverage Diff             @@
##           master     #558      +/-   ##
==========================================
- Coverage   76.12%   76.08%   -0.04%     
==========================================
  Files         156      156              
  Lines       27074    27121      +47     
==========================================
+ Hits        20609    20635      +26     
- Misses       6465     6486      +21

Impacted Files	Coverage Δ
datafusion/src/physical_plan/window_functions.rs	`86.42% <ø> (+0.71%)`	⬆️
datafusion/src/sql/planner.rs	`84.75% <ø> (ø)`
datafusion/src/physical_plan/planner.rs	`79.84% <33.33%> (+2.30%)`	⬆️
datafusion/src/physical_plan/mod.rs	`80.00% <70.96%> (+0.90%)`	⬆️
datafusion/src/physical_plan/windows.rs	`82.59% <75.00%> (-3.88%)`	⬇️
...afusion/src/physical_plan/expressions/nth_value.rs	`79.41% <75.67%> (-11.07%)`	⬇️
datafusion/src/execution/context.rs	`92.13% <100.00%> (+0.13%)`	⬆️
...fusion/src/physical_plan/expressions/row_number.rs	`94.28% <100.00%> (+13.03%)`	⬆️
datafusion/src/physical_plan/hash_aggregate.rs	`86.54% <100.00%> (ø)`
datafusion/src/scalar.rs	`56.19% <100.00%> (ø)`
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e3e7e29...4a7a499. Read the comment docs.

jimexist · 2021-06-16T15:52:22Z

@Dandandan and @alamb this is ready now

jimexist · 2021-06-16T15:52:58Z

after this pull request i'll rebase and merge #564 so that we can have a benchmark for future iterations

Dandandan · 2021-06-16T18:30:09Z

datafusion/src/physical_plan/expressions/nth_value.rs

+            new_null_array(value.data_type(), num_rows)
+        } else {
+            let value = ScalarValue::try_from_array(value, index)?;
+            value.to_array_of_size(num_rows)


The same here applies for normal aggregations as probably happens here: if we have a partition by that creates a lot of groups, we will create many individual arrow arrays (which is slow / memory consuming).

Probably what would be better in the long run is store the offsets to the values in a contiguous array, and the values as well and extend / update them instead.

Not needed for this PR btw, but just noting there are similar needs/performance issues in both aggregation and window functions.

I agree, not in this pull request but I believe this can warrant a dedicated compute kernel in arrow for batched array slice transformation and then concatenation

@Dandandan down the road I've started to work on this issue:

https://github.com/apache/arrow-datafusion/pull/579/files#diff-8b6b5ea3976c91229244e4e7a31a7026422b1374d1683e44b41af67a6bd43187R246-R254

- let results = partition_points - .iter() - .map(|partition_range| { - let sort_partition_points = - find_ranges_in_range(partition_range, &sort_partition_points); - let mut window_accumulators = self.create_accumulator()?; - sort_partition_points - .iter() - .map(|range| window_accumulators.scan_peers(&values, range)) - .collect::<Result<Vec<_>>>() - }) - .collect::<Result<Vec<Vec<ArrayRef>>>>()? - .into_iter() - .flatten() - .collect::<Vec<ArrayRef>>(); - let results = results.iter().map(|i| i.as_ref()).collect::<Vec<_>>(); - concat(&results).map_err(DataFusionError::ArrowError) + let mut result = Vec::with_capacity(num_rows); + for partition_range in partition_points { + let sort_partition_points = + find_ranges_in_range(&partition_range, &sort_partition_points); + let mut window_accumulators = self.create_accumulator()?; + for range in sort_partition_points { + result.extend(window_accumulators.scan_peers(&values, range)?); + } + } + ScalarValue::iter_to_array(result.into_iter())

Yes - that should probably already be quite an improvement 👍

datafusion/src/execution/context.rs

jimexist · 2021-06-19T08:30:18Z

@Dandandan this is fixed now

datafusion/src/execution/context.rs

Dandandan

Looks great again! - 2 comments about tests for being a bit more future proof

jimexist · 2021-06-19T11:23:51Z

Looks great again! - 2 comments about tests for being a bit more future proof

fixed, about repartition i'll handle that in #569 but so far i'm seeing regressions in performance

alamb · 2021-06-21T10:57:01Z

Thanks @jimexist

…pping` (apache#558) * Re-implement int decode methods using safe code * fix * fix tests * more tests * fix * fix * re-implement using unsafe read/write unaligned and add benchmark * lint * macros * more macros * combine macros * replace another impl with the macro * fix a regression * remove zero_value arg from generate_cast_to_signed and rename impl_plain_decoding_int to the original name of make_int_variant_impl

Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.35.0 to 1.35.1. - [Release notes](https://github.com/tokio-rs/tokio/releases) - [Commits](tokio-rs/tokio@tokio-1.35.0...tokio-1.35.1) --- updated-dependencies: - dependency-name: tokio dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

jimexist force-pushed the impl-window-partition-by branch from e663573 to 7cb72ec Compare June 14, 2021 09:54

jimexist changed the title ~~Impl window partition by~~ Implement window functions with partition_by clause Jun 14, 2021

jimexist force-pushed the impl-window-partition-by branch 2 times, most recently from 0c6f31f to c3c0ef5 Compare June 15, 2021 00:16

jimexist marked this pull request as ready for review June 15, 2021 00:40

jimexist force-pushed the impl-window-partition-by branch from 4a7a499 to 1ae529f Compare June 16, 2021 11:31

jimexist mentioned this pull request Jun 16, 2021

Add benchmarks to window function queries #564

Merged

jimexist mentioned this pull request Jun 16, 2021

python bindings for window functions #572

Closed

Dandandan reviewed Jun 16, 2021

View reviewed changes

datafusion/src/execution/context.rs Outdated Show resolved Hide resolved

jimexist force-pushed the impl-window-partition-by branch 2 times, most recently from 0ab7340 to 4f98195 Compare June 19, 2021 06:15

Dandandan reviewed Jun 19, 2021

View reviewed changes

datafusion/src/execution/context.rs Outdated Show resolved Hide resolved

Dandandan reviewed Jun 19, 2021

View reviewed changes

datafusion/src/execution/context.rs Outdated Show resolved Hide resolved

Dandandan approved these changes Jun 19, 2021

View reviewed changes

implement window functions with partition by

2488674

jimexist force-pushed the impl-window-partition-by branch from 4f98195 to 2488674 Compare June 19, 2021 11:23

alamb approved these changes Jun 21, 2021

View reviewed changes

alamb merged commit 05d5f01 into apache:master Jun 21, 2021

houqp added datafusion enhancement New feature or request labels Jul 30, 2021

Conversation

jimexist commented Jun 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

codecov-commenter commented Jun 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jimexist commented Jun 16, 2021

Uh oh!

jimexist commented Jun 16, 2021

Uh oh!

Dandandan Jun 16, 2021

Choose a reason for hiding this comment

Uh oh!

Dandandan Jun 16, 2021

Choose a reason for hiding this comment

Uh oh!

jimexist Jun 17, 2021

Choose a reason for hiding this comment

Uh oh!

jimexist Jun 17, 2021

Choose a reason for hiding this comment

Uh oh!

Dandandan Jun 19, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jimexist commented Jun 19, 2021

Uh oh!

Uh oh!

Uh oh!

Dandandan left a comment

Choose a reason for hiding this comment

Uh oh!

jimexist commented Jun 19, 2021

Uh oh!

alamb commented Jun 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jimexist commented Jun 14, 2021 •

edited

Loading

codecov-commenter commented Jun 14, 2021 •

edited

Loading