-
Notifications
You must be signed in to change notification settings - Fork 29.2k
[SPARK-56599][SQL] Add scan narrowing for column-level UPDATEs in DSv2 #55518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
4df6279
d4babf7
f33797f
e3c253e
4060cbf
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -105,4 +105,48 @@ default String description() { | |
| default NamedReference[] requiredMetadataAttributes() { | ||
| return new NamedReference[0]; | ||
| } | ||
|
|
||
| /** | ||
| * Controls whether to send only the required data columns to the connector rather than the | ||
| * full row. | ||
| * <p> | ||
| * When true, Spark narrows the data column schema ({@link LogicalWriteInfo#schema()}) to only | ||
| * the columns declared via {@link #requiredDataAttributes()}. Metadata columns (from | ||
| * {@link #requiredMetadataAttributes()}) and row ID columns (from | ||
| * {@link SupportsDelta#rowId()}) are unaffected and always projected separately. | ||
| * <p> | ||
| * If {@link #requiredDataAttributes()} returns a non-empty array, the write schema is exactly | ||
| * those columns in declared order. The connector must include all columns it wants to receive, | ||
| * including the columns being updated. If {@link #requiredDataAttributes()} returns an empty | ||
| * array, Spark sends only the non-identity assigned columns (heuristic path). | ||
| * <p> | ||
| * Currently only consulted for UPDATE operations. | ||
| * | ||
| * @since 4.3.0 | ||
| */ | ||
| default boolean supportsColumnUpdates() { | ||
| return false; | ||
| } | ||
|
|
||
| /** | ||
| * Returns data column references required to perform this row-level operation. | ||
| * <p> | ||
| * This method is only consulted by Spark when {@link #supportsColumnUpdates()} returns | ||
| * {@code true}. If {@code supportsColumnUpdates()} returns {@code false}, the returned array | ||
| * is ignored and the full table row is sent (the default behavior). | ||
| * <p> | ||
| * When non-empty, the returned columns become the write schema in declared order. | ||
| * The connector must declare all columns it wants to receive, including the columns being | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is very strong assumption, but it seems that this PR didn't have a protection. May I ask if we have some kind of assertion or a test coverage for this?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Each column the connector returns passes through I added a test |
||
| * updated. Use {@link RowLevelOperationInfo#updatedColumns()} to learn which columns are being | ||
| * assigned, then add any extra columns needed for row lookup or routing (e.g., primary key). | ||
| * <p> | ||
| * When empty (the default), Spark falls back to sending only the non-identity assigned columns. | ||
| * <p> | ||
| * Currently only consulted for UPDATE operations. | ||
| * | ||
| * @since 4.3.0 | ||
| */ | ||
| default NamedReference[] requiredDataAttributes() { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given the scope of this PR, shall we mention that
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Even though the scope of this PR is UPDATE only, we'd like this API to work for MERGE as well (DELETE doesn't benefit since it doesn't write data columns). I'm still assessing what it takes and will add a section in the SPIP on how it could be implemented. Happy to add a "currently only consulted for UPDATE" note in the Javadoc for now and remove it when MERGE support lands. |
||
| return new NamedReference[0]; | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the scope of this PR, shall we mention that
DELETE and MERGE ignores this method?