[SPARK-40295][SQL] Allow v2 functions with literal args in write distribution/ordering by aokolnychyi · Pull Request #37749 · apache/spark

aokolnychyi · 2022-08-31T21:44:06Z

What changes were proposed in this pull request?

This PR adapts V2ExpressionUtils to support arbitrary transforms with multiple args that are either references or literals.

Why are the changes needed?

After PR #36995, data sources can request distribution and ordering that reference v2 functions. If a data source needs a transform with multiple input args or a transform where not all args are references, Spark will throw an exception.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

This PR adapts the test added recently in PR #36995.

…ribution/ordering

aokolnychyi · 2022-08-31T21:44:37Z

@cloud-fan @sunchao @pan3793, could you take a look?

aokolnychyi · 2022-08-31T22:24:30Z

+        Literal.create(l.value, l.dataType)
+      case arg =>
+        throw new AnalysisException(
+          s"Only references and literals are supported as transform arguments: $arg")


Technically, transform args can be arbitrary V2 expressions but I am not sure we want to invest into building a framework for supporting those right now.

pan3793

Thanks @aokolnychyi, the change makes sense to me.

aokolnychyi · 2022-09-03T00:47:51Z

    KeyGroupedPartitioning(expressions, partitionValues.size, Some(partitionValues))
  }
+
+  def supportsExpressions(expressions: Seq[Expression]): Boolean = {


@sunchao @cloud-fan, I went back and forth on where to add this validation. I decided to add it here as it is a current limitation of internal Catalyst KeyGroupedPartitioning. It is fine if a data source reports a partitioning with multi-arg transforms, we just can't benefit from it right now.

Let me know what you think. I also added a test to KeyGroupedPartitioningSuite.

Is this expected to fail for both BucketTransform and SortedBucketTransform, which always have more than 1 child expression?

I have managed to execute a SortMergeJoin of data sources partitioned by (SingleColumnTransform, SortedBucketTransform) benefitting from the partitioning, but I have had to transform the DataSourceV2Relation to DataSourceV2ScanRelation myself and I think it is the only way because of this check. Could it be dropped?

I guess, this code is supposed to make sure there is only one child, but it was skipped in my execution as sorted was empty, though I don't understand what is this requirement for:
https://github.com/apache/spark/pull/35657/files#diff-715d0c2d59a4ddb8a4b5952c1b05be9f035b6d9b0d9670c70b58989dc722b252R86
Making sorted empty let me pass this check, though I need it for the join optimization. But I have managed to bring this property back on the physical level using org.apache.spark.sql.connector.read.SupportsReportOrdering and execute the SortMergeJoin without the exchanges and sorts.

sunchao

LGTM

cloud-fan

LGTM except for one suggestion

…s/physical/partitioning.scala Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

sunchao · 2022-09-07T16:16:43Z

Merged to master. Thanks @aokolnychyi , @cloud-fan , @pan3793 !

aokolnychyi · 2022-09-08T09:31:32Z

Thanks for reviewing, @sunchao @cloud-fan @pan3793!

[SPARK-40295][SQL] Allow v2 functions with literal args in write dist…

df7a52b

…ribution/ordering

github-actions Bot added the SQL label Aug 31, 2022

aokolnychyi commented Aug 31, 2022

View reviewed changes

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/V2ExpressionUtils.scala

aokolnychyi commented Aug 31, 2022

View reviewed changes

cloud-fan reviewed Sep 1, 2022

View reviewed changes

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/V2ExpressionUtils.scala Outdated

aokolnychyi added 2 commits August 31, 2022 21:47

Support nested transforms

a4e0c2d

Revert unnecessary change

f4b0066

cloud-fan approved these changes Sep 1, 2022

View reviewed changes

pan3793 approved these changes Sep 1, 2022

View reviewed changes

Add validation

5d31995

aokolnychyi commented Sep 3, 2022

View reviewed changes

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala Outdated

aokolnychyi closed this Sep 4, 2022

aokolnychyi reopened this Sep 4, 2022

Switch to NamedExpression

79d2a50

cloud-fan reviewed Sep 6, 2022

View reviewed changes

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala Outdated

Improve the check

0e43c19

sunchao approved these changes Sep 6, 2022

View reviewed changes

cloud-fan reviewed Sep 7, 2022

View reviewed changes

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala Outdated

cloud-fan approved these changes Sep 7, 2022

View reviewed changes

Update sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plan…

85d0e46

…s/physical/partitioning.scala Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

sunchao closed this in 127ccc2 Sep 7, 2022

szehon-ho mentioned this pull request Dec 17, 2024

[SPARK-50593][SQL] SPJ: Support truncate transform #49211

Closed

Conversation

aokolnychyi commented Aug 31, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

aokolnychyi commented Aug 31, 2022

Uh oh!

Uh oh!

aokolnychyi Aug 31, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pan3793 left a comment

Choose a reason for hiding this comment

Uh oh!

aokolnychyi Sep 3, 2022

Choose a reason for hiding this comment

Uh oh!

faucct Dec 18, 2023

Choose a reason for hiding this comment

Uh oh!

faucct Dec 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

faucct Dec 18, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

sunchao commented Sep 7, 2022

Uh oh!

aokolnychyi commented Sep 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

faucct Dec 18, 2023 •

edited

Loading