Skip to content

[SPARK-50593][SQL] SPJ: Support truncate transform#49211

Closed
szehon-ho wants to merge 2 commits into
apache:masterfrom
szehon-ho:SPARK-50593
Closed

[SPARK-50593][SQL] SPJ: Support truncate transform#49211
szehon-ho wants to merge 2 commits into
apache:masterfrom
szehon-ho:SPARK-50593

Conversation

@szehon-ho
Copy link
Copy Markdown
Member

@szehon-ho szehon-ho commented Dec 17, 2024

What changes were proposed in this pull request?

Truncate(col, len) partition transform is not supported in SPJ.

It seems for transforms with literal args, support was added for write distribution and ordering in #37749 but somehow storage-partition join does not support it despite the comment in : https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/sql/TestStoragePartitionedJoins.java#L129

This change extends SPJ support to transforms where there is a singular column reference attribute and other literal attributes. Note, it seems bucket (another instance of this) is supported in SPJ via another mechanism, by having an explicit BucketTransform, but not sure it is necessary here.

Why are the changes needed?

A join between tables partitioned by truncate transform, on the partition column, do not benefit from skipping shuffle.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New unit test in KeyGroupedPartitioningSuite

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions Bot added the SQL label Dec 17, 2024
@szehon-ho
Copy link
Copy Markdown
Member Author

szehon-ho commented Dec 17, 2024

Putting this PR up for discussion about this problem, as I may lack the context. This PR has one way to support truncate and any future transform in this category (single attribute reference but multi-argument), but wondering if its overkill or is there any better way.

Or do we need to add an explicit TruncateTransform, and support it like BucketTransform?

@sunchao @aokolnychyi

Copy link
Copy Markdown
Member Author

@szehon-ho szehon-ho Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to fix the 'non-clustered distribution: V2 function with multiple args' test below, which now needs to compare the actual partitionRow by values (and not just reference).

@szehon-ho
Copy link
Copy Markdown
Member Author

Note : This unfortunately doesn't work to identify the negative case (not making SPJ when truncate has different arguments), without breaking some existing test in EnsureRequirementSuite.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 4, 2025

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant