[SPARK-56599][SQL] Add scan narrowing for column-level UPDATEs in DSv2 by anuragmantri · Pull Request #55518 · apache/spark

anuragmantri · 2026-04-23T18:30:03Z

What changes were proposed in this pull request?

This PR adds three new default methods to the DSv2 connector API to enable scan and write-schema narrowing for column-level UPDATEs:

updatedColumns() on RowLevelOperationInfo — Spark informs the connector which columns are being assigned (non-identity only) before the operation is
built.
requiredDataAttributes() on RowLevelOperation — the connector declares the exact set of data columns it needs in the write schema, symmetric with
requiredMetadataAttributes().
supportsColumnUpdates() on RowLevelOperation — explicit opt-in for receiving a partial row instead of the full table row.

When a connector opts in, Spark removes identity assignments from the write plan's Project node, unblocking ColumnPruning to narrow the physical scan automatically (MOR path). For CoW, scan narrowing is done at analysis time via buildRelationWithAttrs() since GroupBasedRowLevelOperationScanPlanning reads DataSourceV2Relation.output before ColumnPruning fires.

All three methods have default implementations that preserve today's full-row behavior. No existing connector is affected.

Why are the changes needed?

Today, Spark's analyzer generates identity assignments for every column during UPDATE alignment. These are used to build a Project that references all columns , preventing Optimizer from narrowing the scan. The cost scales as O(table width) regardless of how many columns are being updated.

This is especially wasteful for columnar formats like Parquet/Iceberg and is a blocker for efficient column-level update implementations in connectors (see the Efficient Column Updates Proposal in Iceberg).

Does this PR introduce any user-facing change?

Yes. Three new default methods are added to the public DSv2 connector API:

RowLevelOperation.supportsColumnUpdates()
RowLevelOperation.requiredDataAttributes()
RowLevelOperationInfo.updatedColumns()

All are opt-in with backward-compatible defaults. Existing connectors see no change.

How was this patch tested?

31 new tests in DeltaBasedColumnUpdateTableSuite covering scan narrowing, write-schema narrowing, data correctness, identity assignment filtering, updatedColumns behavior, and requiredDataAttributes across MOR (delta), CoW (ReplaceData), and delete-then-reinsert paths.
6 new updatedColumns tests in DeltaBasedUpdateTableSuiteBase.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.6

I used Claude Code to generate code and tests and manually reviewed the generated code.

peter-toth · 2026-04-27T16:26:08Z

+      val required =
+        AttributeSet(dataAttrs) ++ AttributeSet(Seq(cond)) ++ AttributeSet(rowIdAttrs)
+      val narrowOutput = relation.output.filter(required.contains)
+      relation.copy(table = table, output = dedupAttrs(narrowOutput ++ rowIdAttrs ++ metadataAttrs))


Can an attribute in required be missing from relation.output?
rowIdAttrs seems to be added 2 times.
If we already have a dedupAttrs() then probably doesn't make sense build AttributeSets.

Can an attribute in required be missing from relation.output?

No. dataAttrs come from the connector's requiredDataAttributes() which are resolved against relation (via V2ExpressionUtils.resolveRefs), so they're guaranteed to be present. The condition's referenced columns are also table columns from the user's WHERE clause. rowIdAttrs and metadataAttrs can be absent from relation.output (they're resolved separately), but they're not part of the filter. They're appended unconditionally afterward via dedupAttrs(narrowOutput ++ rowIdAttrs ++ metadataAttrs)

rowIdAttrs seems to be added 2 times. If we already have dedupAttrs() then probably doesn't make sense to build AttributeSets.

Agreed. I fixed it.

dongjoon-hyun

Could you resolve the conflicts, @anuragmantri ?

anuragmantri · 2026-05-05T23:38:47Z

Could you resolve the conflicts, @anuragmantri ?

Thanks. I rebased and fixed the conflicts.

dongjoon-hyun · 2026-05-06T00:34:42Z

    return new NamedReference[0];
  }
+
+


nit. Remove redundant empty line.

dongjoon-hyun · 2026-05-06T00:35:34Z

+   * including the columns being updated. If {@link #requiredDataAttributes()} returns an empty
+   * array, Spark sends only the non-identity assigned columns (heuristic path).
+   *
+   * @since 4.2.0


4.2.0 -> 4.3.0

dongjoon-hyun · 2026-05-06T00:35:41Z

+   * <p>
+   * When empty (the default), Spark falls back to sending only the non-identity assigned columns.
+   *
+   * @since 4.2.0


ditto. 4.3.0

dongjoon-hyun · 2026-05-06T00:40:28Z

-          val table = buildOperationTable(tbl, UPDATE, CaseInsensitiveStringMap.empty())
+          val updatedCols = assignments.collect {
+            case Assignment(key: AttributeReference, value)
+                if !isIdentityAssignment(key, value) =>


One liner doesn't violate the line-length rule, does it?

- case Assignment(key: AttributeReference, value) - if !isIdentityAssignment(key, value) => + case Assignment(key: AttributeReference, value) if !isIdentityAssignment(key, value) =>

dongjoon-hyun · 2026-05-06T00:44:55Z

+  //
+  // When dataAttrs is non-empty, the relation output is narrowed to include only columns
+  // required for a column-update write. When dataAttrs is empty, the full relation.output is
+  // preserved.


For function description, please follow the community style like the other code path.

/** * ... */

dongjoon-hyun · 2026-05-06T00:45:06Z

+  // When the connector supports column updates and declares required data attributes,
+  // the read relation is narrowed at analysis time so that
+  // GroupBasedRowLevelOperationScanPlanning uses only the needed columns for the scan.
+  // Otherwise the full relation output is used.


For function description, please follow the community style like the other code path.

/** * ... */

dongjoon-hyun · 2026-05-06T00:45:15Z

    WriteDelta(writeRelation, cond, rowDeltaPlan, relation, projections, groupFilterCond)
  }

+  // Builds the row delta projection for the column update path.


For function description, please follow the community style like the other code path.

/** * ... */

dongjoon-hyun · 2026-05-06T00:45:28Z

+      dataAttrsResolved(inRowAttrs)
+  }
+
+  // Validates the narrow-write-schema row projection output.


For function description, please follow the community style like the other code path.

/** * ... */

dongjoon-hyun · 2026-05-06T00:46:36Z

-    table.skipSchemaResolution || areCompatible(inRowAttrs, outRowAttrs)
+    table.skipSchemaResolution ||
+      areCompatible(inRowAttrs, outRowAttrs) ||
+      dataAttrsResolved(inRowAttrs)


nit. Please minimize the change of existing code as much as possible like the following.

table.skipSchemaResolution || areCompatible(inRowAttrs, outRowAttrs) || dataAttrsResolved(inRowAttrs)

dongjoon-hyun · 2026-05-06T00:51:39Z

+   * is ignored and the full table row is sent (the default behavior).
+   * <p>
+   * When non-empty, the returned columns become the write schema in declared order.
+   * The connector must declare all columns it wants to receive, including the columns being


This is very strong assumption, but it seems that this PR didn't have a protection. May I ask if we have some kind of assertion or a test coverage for this?

Each column the connector returns passes through V2ExpressionUtils.resolveRefs which throws AnalysisException if the column is non existent.

I added a test test("column-update: requiredDataAttributes throws AnalysisException for invalid column")

dongjoon-hyun · 2026-05-06T00:56:47Z

+  //
+  // ColumnPruning observes exactly these references and narrows the physical scan accordingly.
+  // Connectors that need additional columns in the scan (e.g., partition columns for
+  // distribution) should declare them in requiredDataAttributes().


IIUC, for the correctness, we need to throw AnalysisException if requiredDataAttributes is invalid.

Each column the connector returns passes through V2ExpressionUtils.resolveRefs which throws AnalysisException if the column is non existent.

I added a test test("column-update: requiredDataAttributes throws AnalysisException for invalid column")

dongjoon-hyun · 2026-05-06T00:58:39Z

+  // Connectors that need additional columns in the scan (e.g., partition columns for
+  // distribution) should declare them in requiredDataAttributes().
+  //
+  // Note: AlignUpdateAssignments guarantees all assignment keys are top-level


Do we have a test coverage for this, AlignUpdateAssignments contract?

I added a new test test("column-update: nested struct field update narrows to the root struct column") that updates an inner field in a struct, the AlignUpdateAssignment returns only the root key.

dongjoon-hyun · 2026-05-06T00:59:09Z

+   * whether pk is already in the updated columns list and, if not, add it to
+   * requiredDataAttributes().
+   *
+   * @since 4.2.0


dongjoon-hyun · 2026-05-06T01:01:27Z

-    // build a plan to replace read groups in the table
    val writeRelation = relation.copy(table = operationTable)
-    val projections = buildReplaceDataProjections(query, relation.output, metadataAttrs)
+    val query = updatedAndRemainingRowsPlan


This looks like duplications: Let's use one variable instead of mixing two variables, updatedAndRemainingRowsPlan and query.

Done, used a single variable

dongjoon-hyun · 2026-05-06T01:03:37Z

+    //   GroupBasedRowLevelOperationScanPlanning needs explicit column declarations to narrow.
+    val rowAttrs: Seq[Attribute] = if (isNarrow) connectorDataAttrs else relation.output
+
+    (readRelation, rowAttrs)


Please return metadataAttrs too to avoid the following recomputation in the caller-side.

val metadataAttrs = resolveRequiredMetadataAttrs(relation, operationTable.operation)

I changed this to return metadataAttrs too.

dongjoon-hyun · 2026-05-06T01:04:29Z

+  //
+  // Works for both the full-scan and narrow-scan CoW paths.  In the narrow case,
+  // readRelation.output is already restricted by buildCoWReadSetup, so projecting
+  // all plan.output gives the correct narrow write schema.


Use function description style.

dongjoon-hyun · 2026-05-06T01:07:04Z

+   *
+   * @since 4.2.0
+   */
+  default boolean supportsColumnUpdates() {


Given the scope of this PR, shall we mention that DELETE and MERGE ignores this method?

dongjoon-hyun · 2026-05-06T01:07:18Z

+   *
+   * @since 4.2.0
+   */
+  default NamedReference[] requiredDataAttributes() {


Given the scope of this PR, shall we mention that DELETE and MERGE ignores this method?

Even though the scope of this PR is UPDATE only, we'd like this API to work for MERGE as well (DELETE doesn't benefit since it doesn't write data columns). I'm still assessing what it takes and will add a section in the SPIP on how it could be implemented.

Happy to add a "currently only consulted for UPDATE" note in the Javadoc for now and remove it when MERGE support lands.

dongjoon-hyun · 2026-05-06T01:08:26Z

I finished the first round review, @anuragmantri .

anuragmantri

Thanks for the review @dongjoon-hyun. I addressed your comments and cleaned up some AI generated comments which were redundant.

anuragmantri · 2026-05-06T17:53:45Z

    return new NamedReference[0];
  }
+
+


anuragmantri · 2026-05-06T17:53:50Z

+   * including the columns being updated. If {@link #requiredDataAttributes()} returns an empty
+   * array, Spark sends only the non-identity assigned columns (heuristic path).
+   *
+   * @since 4.2.0


anuragmantri · 2026-05-06T22:50:19Z

+   * is ignored and the full table row is sent (the default behavior).
+   * <p>
+   * When non-empty, the returned columns become the write schema in declared order.
+   * The connector must declare all columns it wants to receive, including the columns being


Each column the connector returns passes through V2ExpressionUtils.resolveRefs which throws AnalysisException if the column is non existent.

I added a test test("column-update: requiredDataAttributes throws AnalysisException for invalid column")

anuragmantri · 2026-05-06T22:51:55Z

+   * whether pk is already in the updated columns list and, if not, add it to
+   * requiredDataAttributes().
+   *
+   * @since 4.2.0


anuragmantri · 2026-05-06T22:52:05Z

+  //
+  // When dataAttrs is non-empty, the relation output is narrowed to include only columns
+  // required for a column-update write. When dataAttrs is empty, the full relation.output is
+  // preserved.


anuragmantri · 2026-05-06T23:48:12Z

+  // Connectors that need additional columns in the scan (e.g., partition columns for
+  // distribution) should declare them in requiredDataAttributes().
+  //
+  // Note: AlignUpdateAssignments guarantees all assignment keys are top-level


I added a new test test("column-update: nested struct field update narrows to the root struct column") that updates an inner field in a struct, the AlignUpdateAssignment returns only the root key.

anuragmantri · 2026-05-06T23:50:05Z

+  //
+  // ColumnPruning observes exactly these references and narrows the physical scan accordingly.
+  // Connectors that need additional columns in the scan (e.g., partition columns for
+  // distribution) should declare them in requiredDataAttributes().


Each column the connector returns passes through V2ExpressionUtils.resolveRefs which throws AnalysisException if the column is non existent.

I added a test test("column-update: requiredDataAttributes throws AnalysisException for invalid column")

anuragmantri · 2026-05-07T00:29:51Z

+      dataAttrsResolved(inRowAttrs)
+  }
+
+  // Validates the narrow-write-schema row projection output.


anuragmantri · 2026-05-07T00:30:56Z

-    table.skipSchemaResolution || areCompatible(inRowAttrs, outRowAttrs)
+    table.skipSchemaResolution ||
+      areCompatible(inRowAttrs, outRowAttrs) ||
+      dataAttrsResolved(inRowAttrs)


anuragmantri · 2026-05-07T01:00:12Z

+   *
+   * @since 4.2.0
+   */
+  default NamedReference[] requiredDataAttributes() {


Even though the scope of this PR is UPDATE only, we'd like this API to work for MERGE as well (DELETE doesn't benefit since it doesn't write data columns). I'm still assessing what it takes and will add a section in the SPIP on how it could be implemented.

Happy to add a "currently only consulted for UPDATE" note in the Javadoc for now and remove it when MERGE support lands.

anuragmantri · 2026-05-07T01:15:18Z

-        .getOrElse {
-          throw new AnalysisException(
-            errorClass = "_LEGACY_ERROR_TEMP_3075",
-            messageParameters = Map(
-              "tableAttr" -> tableAttr.toString,
-              "scanAttrs" -> scanAttrs.mkString(",")))
-        }
    }


I believe this is safe because condition-referenced columns are guaranteed to be in the scan. Please correct me if I'm wrong.

dongjoon-hyun

Thank you for updating, @anuragmantri .

BTW, I cannot find the vote for the mentioned SPIP. Does pass the vote officially, @anuragmantri ? For SPIP, we need an official vote result to move forward including merging something, don't we? (cc @huaxingao as the Shepherd of SPARK-56599 JIRA issue)

What changes were proposed in this pull request?

For SPIP: SPARK-56599

cc @aokolnychyi too because RowLevelOperation.java has been never changed since being added 4 years ago via the following.

#36304

anuragmantri · 2026-05-08T13:29:15Z

Thanks for the review @dongjoon-hyun. For the SPIP, we are waiting for a few more maintainers to also review the design as well as the PR before going for a vote.

Addressed.

anuragmantri force-pushed the dsv2-required-data-attrs branch from fb14c34 to ae635f4 Compare April 23, 2026 20:51

peter-toth reviewed Apr 27, 2026

View reviewed changes

dongjoon-hyun reviewed May 5, 2026

View reviewed changes

anuragmantri force-pushed the dsv2-required-data-attrs branch from ae635f4 to a99bb2d Compare May 5, 2026 23:32

dongjoon-hyun reviewed May 6, 2026

View reviewed changes

dongjoon-hyun previously requested changes May 6, 2026

View reviewed changes

dongjoon-hyun reviewed May 6, 2026

View reviewed changes

anuragmantri commented May 7, 2026

View reviewed changes

dongjoon-hyun reviewed May 8, 2026

View reviewed changes

[SPARK-56599][SQL] Add scan narrowing for column-level UPDATEs in DSv2

4df6279

anuragmantri added 4 commits May 28, 2026 23:25

Address review comments and rebase

d4babf7

Remove narrowing of CAST()

f33797f

Address review comments and clean up

e3c253e

Scala fmt

4060cbf

anuragmantri force-pushed the dsv2-required-data-attrs branch from e806004 to 4060cbf Compare May 29, 2026 06:44

Conversation

anuragmantri commented Apr 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

anuragmantri commented May 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun May 6, 2026 •

edited

Loading

anuragmantri left a comment •

edited

Loading