Skip to content

[SPARK-40295][SQL] Allow v2 functions with literal args in write distribution/ordering#37749

Closed
aokolnychyi wants to merge 7 commits into
apache:masterfrom
aokolnychyi:spark-40295
Closed

[SPARK-40295][SQL] Allow v2 functions with literal args in write distribution/ordering#37749
aokolnychyi wants to merge 7 commits into
apache:masterfrom
aokolnychyi:spark-40295

Conversation

@aokolnychyi
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR adapts V2ExpressionUtils to support arbitrary transforms with multiple args that are either references or literals.

Why are the changes needed?

After PR #36995, data sources can request distribution and ordering that reference v2 functions. If a data source needs a transform with multiple input args or a transform where not all args are references, Spark will throw an exception.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

This PR adapts the test added recently in PR #36995.

@github-actions github-actions Bot added the SQL label Aug 31, 2022
@aokolnychyi
Copy link
Copy Markdown
Contributor Author

@cloud-fan @sunchao @pan3793, could you take a look?

Literal.create(l.value, l.dataType)
case arg =>
throw new AnalysisException(
s"Only references and literals are supported as transform arguments: $arg")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, transform args can be arbitrary V2 expressions but I am not sure we want to invest into building a framework for supporting those right now.

Copy link
Copy Markdown
Member

@pan3793 pan3793 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aokolnychyi, the change makes sense to me.

KeyGroupedPartitioning(expressions, partitionValues.size, Some(partitionValues))
}

def supportsExpressions(expressions: Seq[Expression]): Boolean = {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunchao @cloud-fan, I went back and forth on where to add this validation. I decided to add it here as it is a current limitation of internal Catalyst KeyGroupedPartitioning. It is fine if a data source reports a partitioning with multi-arg transforms, we just can't benefit from it right now.

Let me know what you think. I also added a test to KeyGroupedPartitioningSuite.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this expected to fail for both BucketTransform and SortedBucketTransform, which always have more than 1 child expression?

Copy link
Copy Markdown

@faucct faucct Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have managed to execute a SortMergeJoin of data sources partitioned by (SingleColumnTransform, SortedBucketTransform) benefitting from the partitioning, but I have had to transform the DataSourceV2Relation to DataSourceV2ScanRelation myself and I think it is the only way because of this check. Could it be dropped?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, this code is supposed to make sure there is only one child, but it was skipped in my execution as sorted was empty, though I don't understand what is this requirement for:
https://github.com/apache/spark/pull/35657/files#diff-715d0c2d59a4ddb8a4b5952c1b05be9f035b6d9b0d9670c70b58989dc722b252R86
Making sorted empty let me pass this check, though I need it for the join optimization. But I have managed to bring this property back on the physical level using org.apache.spark.sql.connector.read.SupportsReportOrdering and execute the SortMergeJoin without the exchanges and sorts.

@aokolnychyi aokolnychyi closed this Sep 4, 2022
@aokolnychyi aokolnychyi reopened this Sep 4, 2022
Copy link
Copy Markdown
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one suggestion

…s/physical/partitioning.scala

Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
@sunchao sunchao closed this in 127ccc2 Sep 7, 2022
@sunchao
Copy link
Copy Markdown
Member

sunchao commented Sep 7, 2022

Merged to master. Thanks @aokolnychyi , @cloud-fan , @pan3793 !

@aokolnychyi
Copy link
Copy Markdown
Contributor Author

Thanks for reviewing, @sunchao @cloud-fan @pan3793!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants