Skip to content

[SPARK-56599][SQL] Add scan narrowing for column-level UPDATEs in DSv2#55518

Open
anuragmantri wants to merge 5 commits into
apache:masterfrom
anuragmantri:dsv2-required-data-attrs
Open

[SPARK-56599][SQL] Add scan narrowing for column-level UPDATEs in DSv2#55518
anuragmantri wants to merge 5 commits into
apache:masterfrom
anuragmantri:dsv2-required-data-attrs

Conversation

@anuragmantri
Copy link
Copy Markdown

What changes were proposed in this pull request?

For SPIP: SPARK-56599

This PR adds three new default methods to the DSv2 connector API to enable scan and write-schema narrowing for column-level UPDATEs:

  • updatedColumns() on RowLevelOperationInfo — Spark informs the connector which columns are being assigned (non-identity only) before the operation is
    built.
  • requiredDataAttributes() on RowLevelOperation — the connector declares the exact set of data columns it needs in the write schema, symmetric with
    requiredMetadataAttributes().
  • supportsColumnUpdates() on RowLevelOperation — explicit opt-in for receiving a partial row instead of the full table row.

When a connector opts in, Spark removes identity assignments from the write plan's Project node, unblocking ColumnPruning to narrow the physical scan automatically (MOR path). For CoW, scan narrowing is done at analysis time via buildRelationWithAttrs() since GroupBasedRowLevelOperationScanPlanning reads DataSourceV2Relation.output before ColumnPruning fires.

All three methods have default implementations that preserve today's full-row behavior. No existing connector is affected.

Why are the changes needed?

Today, Spark's analyzer generates identity assignments for every column during UPDATE alignment. These are used to build a Project that references all columns , preventing Optimizer from narrowing the scan. The cost scales as O(table width) regardless of how many columns are being updated.

This is especially wasteful for columnar formats like Parquet/Iceberg and is a blocker for efficient column-level update implementations in connectors (see the Efficient Column Updates Proposal in Iceberg).

Does this PR introduce any user-facing change?

Yes. Three new default methods are added to the public DSv2 connector API:

  • RowLevelOperation.supportsColumnUpdates()
  • RowLevelOperation.requiredDataAttributes()
  • RowLevelOperationInfo.updatedColumns()

All are opt-in with backward-compatible defaults. Existing connectors see no change.

How was this patch tested?

  • 31 new tests in DeltaBasedColumnUpdateTableSuite covering scan narrowing, write-schema narrowing, data correctness, identity assignment filtering, updatedColumns behavior, and requiredDataAttributes across MOR (delta), CoW (ReplaceData), and delete-then-reinsert paths.
  • 6 new updatedColumns tests in DeltaBasedUpdateTableSuiteBase.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.6

I used Claude Code to generate code and tests and manually reviewed the generated code.

@anuragmantri anuragmantri force-pushed the dsv2-required-data-attrs branch from fb14c34 to ae635f4 Compare April 23, 2026 20:51
Comment on lines +75 to +78
val required =
AttributeSet(dataAttrs) ++ AttributeSet(Seq(cond)) ++ AttributeSet(rowIdAttrs)
val narrowOutput = relation.output.filter(required.contains)
relation.copy(table = table, output = dedupAttrs(narrowOutput ++ rowIdAttrs ++ metadataAttrs))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can an attribute in required be missing from relation.output?
rowIdAttrs seems to be added 2 times.
If we already have a dedupAttrs() then probably doesn't make sense build AttributeSets.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can an attribute in required be missing from relation.output?

No. dataAttrs come from the connector's requiredDataAttributes() which are resolved against relation (via V2ExpressionUtils.resolveRefs), so they're guaranteed to be present. The condition's referenced columns are also table columns from the user's WHERE clause. rowIdAttrs and metadataAttrs can be absent from relation.output (they're resolved separately), but they're not part of the filter. They're appended unconditionally afterward via dedupAttrs(narrowOutput ++ rowIdAttrs ++ metadataAttrs)

rowIdAttrs seems to be added 2 times. If we already have dedupAttrs() then probably doesn't make sense to build AttributeSets.

Agreed. I fixed it.

Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you resolve the conflicts, @anuragmantri ?

@anuragmantri anuragmantri force-pushed the dsv2-required-data-attrs branch from ae635f4 to a99bb2d Compare May 5, 2026 23:32
@anuragmantri
Copy link
Copy Markdown
Author

Could you resolve the conflicts, @anuragmantri ?

Thanks. I rebased and fixed the conflicts.

return new NamedReference[0];
}


Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. Remove redundant empty line.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* including the columns being updated. If {@link #requiredDataAttributes()} returns an empty
* array, Spark sends only the non-identity assigned columns (heuristic path).
*
* @since 4.2.0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4.2.0 -> 4.3.0

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* <p>
* When empty (the default), Spark falls back to sending only the non-identity assigned columns.
*
* @since 4.2.0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. 4.3.0

val table = buildOperationTable(tbl, UPDATE, CaseInsensitiveStringMap.empty())
val updatedCols = assignments.collect {
case Assignment(key: AttributeReference, value)
if !isIdentityAssignment(key, value) =>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One liner doesn't violate the line-length rule, does it?

- case Assignment(key: AttributeReference, value)
-                 if !isIdentityAssignment(key, value) =>
+ case Assignment(key: AttributeReference, value) if !isIdentityAssignment(key, value) =>

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

//
// When dataAttrs is non-empty, the relation output is narrowed to include only columns
// required for a column-update write. When dataAttrs is empty, the full relation.output is
// preserved.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For function description, please follow the community style like the other code path.

/**
 * ...
 */

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// When the connector supports column updates and declares required data attributes,
// the read relation is narrowed at analysis time so that
// GroupBasedRowLevelOperationScanPlanning uses only the needed columns for the scan.
// Otherwise the full relation output is used.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For function description, please follow the community style like the other code path.

/**
 * ...
 */

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

WriteDelta(writeRelation, cond, rowDeltaPlan, relation, projections, groupFilterCond)
}

// Builds the row delta projection for the column update path.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For function description, please follow the community style like the other code path.

/**
 * ...
 */

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

dataAttrsResolved(inRowAttrs)
}

// Validates the narrow-write-schema row projection output.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For function description, please follow the community style like the other code path.

/**
 * ...
 */

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

table.skipSchemaResolution || areCompatible(inRowAttrs, outRowAttrs)
table.skipSchemaResolution ||
areCompatible(inRowAttrs, outRowAttrs) ||
dataAttrsResolved(inRowAttrs)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. Please minimize the change of existing code as much as possible like the following.

table.skipSchemaResolution || areCompatible(inRowAttrs, outRowAttrs) ||
      dataAttrsResolved(inRowAttrs)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* is ignored and the full table row is sent (the default behavior).
* <p>
* When non-empty, the returned columns become the write schema in declared order.
* The connector must declare all columns it wants to receive, including the columns being
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very strong assumption, but it seems that this PR didn't have a protection. May I ask if we have some kind of assertion or a test coverage for this?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each column the connector returns passes through V2ExpressionUtils.resolveRefs which throws AnalysisException if the column is non existent.

I added a test test("column-update: requiredDataAttributes throws AnalysisException for invalid column")

//
// ColumnPruning observes exactly these references and narrows the physical scan accordingly.
// Connectors that need additional columns in the scan (e.g., partition columns for
// distribution) should declare them in requiredDataAttributes().
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, for the correctness, we need to throw AnalysisException if requiredDataAttributes is invalid.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each column the connector returns passes through V2ExpressionUtils.resolveRefs which throws AnalysisException if the column is non existent.

I added a test test("column-update: requiredDataAttributes throws AnalysisException for invalid column")

// Connectors that need additional columns in the scan (e.g., partition columns for
// distribution) should declare them in requiredDataAttributes().
//
// Note: AlignUpdateAssignments guarantees all assignment keys are top-level
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test coverage for this, AlignUpdateAssignments contract?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new test test("column-update: nested struct field update narrows to the root struct column") that updates an inner field in a struct, the AlignUpdateAssignment returns only the root key.

* whether pk is already in the updated columns list and, if not, add it to
* requiredDataAttributes().
*
* @since 4.2.0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4.3.0

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// build a plan to replace read groups in the table
val writeRelation = relation.copy(table = operationTable)
val projections = buildReplaceDataProjections(query, relation.output, metadataAttrs)
val query = updatedAndRemainingRowsPlan
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like duplications: Let's use one variable instead of mixing two variables, updatedAndRemainingRowsPlan and query.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, used a single variable

// GroupBasedRowLevelOperationScanPlanning needs explicit column declarations to narrow.
val rowAttrs: Seq[Attribute] = if (isNarrow) connectorDataAttrs else relation.output

(readRelation, rowAttrs)
Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please return metadataAttrs too to avoid the following recomputation in the caller-side.

val metadataAttrs = resolveRequiredMetadataAttrs(relation, operationTable.operation)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to return metadataAttrs too.

//
// Works for both the full-scan and narrow-scan CoW paths. In the narrow case,
// readRelation.output is already restricted by buildCoWReadSetup, so projecting
// all plan.output gives the correct narrow write schema.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use function description style.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

*
* @since 4.2.0
*/
default boolean supportsColumnUpdates() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the scope of this PR, shall we mention that DELETE and MERGE ignores this method?

*
* @since 4.2.0
*/
default NamedReference[] requiredDataAttributes() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the scope of this PR, shall we mention that DELETE and MERGE ignores this method?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though the scope of this PR is UPDATE only, we'd like this API to work for MERGE as well (DELETE doesn't benefit since it doesn't write data columns). I'm still assessing what it takes and will add a section in the SPIP on how it could be implemented.

Happy to add a "currently only consulted for UPDATE" note in the Javadoc for now and remove it when MERGE support lands.

@dongjoon-hyun
Copy link
Copy Markdown
Member

I finished the first round review, @anuragmantri .

Copy link
Copy Markdown
Author

@anuragmantri anuragmantri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @dongjoon-hyun. I addressed your comments and cleaned up some AI generated comments which were redundant.

return new NamedReference[0];
}


Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* including the columns being updated. If {@link #requiredDataAttributes()} returns an empty
* array, Spark sends only the non-identity assigned columns (heuristic path).
*
* @since 4.2.0
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* is ignored and the full table row is sent (the default behavior).
* <p>
* When non-empty, the returned columns become the write schema in declared order.
* The connector must declare all columns it wants to receive, including the columns being
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each column the connector returns passes through V2ExpressionUtils.resolveRefs which throws AnalysisException if the column is non existent.

I added a test test("column-update: requiredDataAttributes throws AnalysisException for invalid column")

* whether pk is already in the updated columns list and, if not, add it to
* requiredDataAttributes().
*
* @since 4.2.0
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

//
// When dataAttrs is non-empty, the relation output is narrowed to include only columns
// required for a column-update write. When dataAttrs is empty, the full relation.output is
// preserved.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// Connectors that need additional columns in the scan (e.g., partition columns for
// distribution) should declare them in requiredDataAttributes().
//
// Note: AlignUpdateAssignments guarantees all assignment keys are top-level
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new test test("column-update: nested struct field update narrows to the root struct column") that updates an inner field in a struct, the AlignUpdateAssignment returns only the root key.

//
// ColumnPruning observes exactly these references and narrows the physical scan accordingly.
// Connectors that need additional columns in the scan (e.g., partition columns for
// distribution) should declare them in requiredDataAttributes().
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each column the connector returns passes through V2ExpressionUtils.resolveRefs which throws AnalysisException if the column is non existent.

I added a test test("column-update: requiredDataAttributes throws AnalysisException for invalid column")

dataAttrsResolved(inRowAttrs)
}

// Validates the narrow-write-schema row projection output.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

table.skipSchemaResolution || areCompatible(inRowAttrs, outRowAttrs)
table.skipSchemaResolution ||
areCompatible(inRowAttrs, outRowAttrs) ||
dataAttrsResolved(inRowAttrs)
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

*
* @since 4.2.0
*/
default NamedReference[] requiredDataAttributes() {
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though the scope of this PR is UPDATE only, we'd like this API to work for MERGE as well (DELETE doesn't benefit since it doesn't write data columns). I'm still assessing what it takes and will add a section in the SPIP on how it could be implemented.

Happy to add a "currently only consulted for UPDATE" note in the Javadoc for now and remove it when MERGE support lands.

Comment on lines -146 to 148
.getOrElse {
throw new AnalysisException(
errorClass = "_LEGACY_ERROR_TEMP_3075",
messageParameters = Map(
"tableAttr" -> tableAttr.toString,
"scanAttrs" -> scanAttrs.mkString(",")))
}
}
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is safe because condition-referenced columns are guaranteed to be in the scan. Please correct me if I'm wrong.

Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updating, @anuragmantri .

BTW, I cannot find the vote for the mentioned SPIP. Does pass the vote officially, @anuragmantri ? For SPIP, we need an official vote result to move forward including merging something, don't we? (cc @huaxingao as the Shepherd of SPARK-56599 JIRA issue)

What changes were proposed in this pull request?

For SPIP: SPARK-56599


cc @aokolnychyi too because RowLevelOperation.java has been never changed since being added 4 years ago via the following.

@anuragmantri
Copy link
Copy Markdown
Author

Thanks for the review @dongjoon-hyun. For the SPIP, we are waiting for a few more maintainers to also review the design as well as the PR before going for a vote.

@dongjoon-hyun dongjoon-hyun dismissed their stale review May 8, 2026 13:43

Addressed.

@anuragmantri anuragmantri force-pushed the dsv2-required-data-attrs branch from e806004 to 4060cbf Compare May 29, 2026 06:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants