Skip to content

[SPARK-56395][SQL][FOLLOWUP] Fix RewriteNearestByJoin nullability for LEFT OUTER#56484

Closed
cloud-fan wants to merge 3 commits into
apache:masterfrom
cloud-fan:SPARK-56395-followup
Closed

[SPARK-56395][SQL][FOLLOWUP] Fix RewriteNearestByJoin nullability for LEFT OUTER#56484
cloud-fan wants to merge 3 commits into
apache:masterfrom
cloud-fan:SPARK-56395-followup

Conversation

@cloud-fan

@cloud-fan cloud-fan commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Followup to #55682.

In RewriteNearestByJoin, when the NEAREST BY join type is LEFT OUTER, the synthesized Join widens the right-side columns to nullable. However, the synthesized Aggregate (and the optional __ranking__ Project) built on top of that join still referenced the right-side columns via right.output and rankingExpression with their original (non-nullable) nullability. As a result the rewritten plan can declare a right-side column as non-nullable while its child -- the join -- produces it as nullable.

This PR maps the right-side attributes to their widened (nullable) form for LEFT OUTER and rewrites both the CreateStruct(right.*) and the ranking expression to use that widened nullability, so the rewritten plan's schema is consistent with its child. For INNER joins the right side is not widened, so this is a no-op.

Why are the changes needed?

Without this fix the rewritten plan for a LEFT OUTER NEAREST BY declares right-side columns non-nullable while its join child produces them nullable -- an inconsistency that nullability/plan-integrity validation flags as a regression.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added a regression test to RewriteNearestByJoinSuite that, for both INNER and LEFT OUTER, asserts every right-side attribute the synthesized Aggregate references agrees on nullability with its join child. The test uses non-nullable right-side columns so that LEFT OUTER's widening is observable -- it fails without this fix (x#.. declared nullable=false but its child produces nullable=true) and passes with it, while INNER stays a no-op. The suite's expected-plan helper was also updated to mirror the widened nullability.

Was this patch authored or co-authored using generative AI tooling?

No.

@yadavay-amzn yadavay-amzn left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one optional follow-up

fix is minimal, idiomatic, and well-tested. No blocking issues.

One non-blocking observation:

The new regression test uses a deterministic ranking. The fix also covers the non-deterministic ranking branch (which routes through Alias(rankingInJoin, "__ranking__")), but no test fails if a future change accidentally regresses that branch. A non-deterministic-ranking variant with non-nullable right inputs would round out coverage. Optional.

What I verified

  • Premise is real: Join.computeOutput widens right-side outputs for LeftOuter. The pre-fix code captures unwidened right.output references in places that sit on top of the widened join, creating a real internal inconsistency.
  • Bug repro and fix confirmed locally: restored the production file to pre-fix while keeping the new regression test, ran it, and got the exact error message documented in the PR body (LeftOuter: x#2 declared nullable=false but its child produces nullable=true).
  • Why existing tests missed the bug: the DSL $"x".int constructs AttributeReference with nullable=true by
    default, making withNullability(true) a no-op. The new test uses non-nullable inputs explicitly, which is what
    the PR body calls out and is the cleanest possible fixture.

… LEFT OUTER

Followup to apache#55682.

In RewriteNearestByJoin, when the NEAREST BY join type is LEFT OUTER, the
synthesized Join widens the right-side columns to nullable. However, the
synthesized Aggregate (and the optional __ranking__ Project) built on top of
that join still referenced the right-side columns via right.output and
rankingExpression with their original (non-nullable) nullability. As a result
the rewritten plan can declare a right-side column as non-nullable while its
child -- the join -- produces it as nullable.

This commit maps the right-side attributes to their widened (nullable) form
for LEFT OUTER and rewrites both the CreateStruct(right.*) and the ranking
expression to use that widened nullability, so the rewritten plan's schema is
consistent with its child. For INNER joins the right side is not widened, so
this is a no-op.
…bility

Update RewriteNearestByJoinSuite's expected-plan helper to mirror the widened
right-side nullability for LEFT OUTER, and add a focused regression test that
asserts every right-side attribute the synthesized Aggregate references agrees
on nullability with its join child. The test uses non-nullable right columns so
the LEFT OUTER widening is observable -- it fails without the production fix and
passes with it. INNER stays a no-op.
… nondeterministic ranking path

Generalize the LEFT OUTER nullability regression test in RewriteNearestByJoinSuite
from an Aggregate-only assertion to a whole-plan integrity walk, and add a
nondeterministic-ranking case alongside the deterministic one across both Inner
and LeftOuter.

The widened ranking reference lands in the Aggregate on the deterministic path but
in a `__ranking__` Project (above the Join) on the nondeterministic path. The old
assertion only inspected Aggregate expressions, so a regression isolated to the
nondeterministic branch would not have been caught. No framework check guards this
class of bug either -- LogicalPlanIntegrity compares types `asNullable` and schemas
via `equalsIgnoreNullability` -- so this test is the only guard.

Verified the new coverage has teeth: reverting just the nondeterministic alias to
the un-widened expression fails the test specifically on LeftOuter/nondeterministic
while the deterministic case still passes.

Co-authored-by: Isaac
@cloud-fan cloud-fan force-pushed the SPARK-56395-followup branch from 589beef to ce977d2 Compare June 14, 2026 14:30
@cloud-fan

Copy link
Copy Markdown
Contributor Author

thanks for review, merging to master/4.x/4.2 (not a user-facing bug, won't fail RC)

@cloud-fan cloud-fan closed this in 62e4e16 Jun 15, 2026
cloud-fan added a commit that referenced this pull request Jun 15, 2026
… LEFT OUTER

### What changes were proposed in this pull request?

Followup to #55682.

In `RewriteNearestByJoin`, when the `NEAREST BY` join type is `LEFT OUTER`, the synthesized `Join` widens the right-side columns to nullable. However, the synthesized `Aggregate` (and the optional `__ranking__` `Project`) built on top of that join still referenced the right-side columns via `right.output` and `rankingExpression` with their original (non-nullable) nullability. As a result the rewritten plan can declare a right-side column as non-nullable while its child -- the join -- produces it as nullable.

This PR maps the right-side attributes to their widened (nullable) form for `LEFT OUTER` and rewrites both the `CreateStruct(right.*)` and the ranking expression to use that widened nullability, so the rewritten plan's schema is consistent with its child. For `INNER` joins the right side is not widened, so this is a no-op.

### Why are the changes needed?

Without this fix the rewritten plan for a `LEFT OUTER NEAREST BY` declares right-side columns non-nullable while its join child produces them nullable -- an inconsistency that nullability/plan-integrity validation flags as a regression.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added a regression test to `RewriteNearestByJoinSuite` that, for both `INNER` and `LEFT OUTER`, asserts every right-side attribute the synthesized `Aggregate` references agrees on nullability with its join child. The test uses **non-nullable** right-side columns so that `LEFT OUTER`'s widening is observable -- it fails without this fix (`x#.. declared nullable=false but its child produces nullable=true`) and passes with it, while `INNER` stays a no-op. The suite's expected-plan helper was also updated to mirror the widened nullability.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #56484 from cloud-fan/SPARK-56395-followup.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 62e4e16)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan added a commit that referenced this pull request Jun 15, 2026
… LEFT OUTER

### What changes were proposed in this pull request?

Followup to #55682.

In `RewriteNearestByJoin`, when the `NEAREST BY` join type is `LEFT OUTER`, the synthesized `Join` widens the right-side columns to nullable. However, the synthesized `Aggregate` (and the optional `__ranking__` `Project`) built on top of that join still referenced the right-side columns via `right.output` and `rankingExpression` with their original (non-nullable) nullability. As a result the rewritten plan can declare a right-side column as non-nullable while its child -- the join -- produces it as nullable.

This PR maps the right-side attributes to their widened (nullable) form for `LEFT OUTER` and rewrites both the `CreateStruct(right.*)` and the ranking expression to use that widened nullability, so the rewritten plan's schema is consistent with its child. For `INNER` joins the right side is not widened, so this is a no-op.

### Why are the changes needed?

Without this fix the rewritten plan for a `LEFT OUTER NEAREST BY` declares right-side columns non-nullable while its join child produces them nullable -- an inconsistency that nullability/plan-integrity validation flags as a regression.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added a regression test to `RewriteNearestByJoinSuite` that, for both `INNER` and `LEFT OUTER`, asserts every right-side attribute the synthesized `Aggregate` references agrees on nullability with its join child. The test uses **non-nullable** right-side columns so that `LEFT OUTER`'s widening is observable -- it fails without this fix (`x#.. declared nullable=false but its child produces nullable=true`) and passes with it, while `INNER` stays a no-op. The suite's expected-plan helper was also updated to mirror the widened nullability.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #56484 from cloud-fan/SPARK-56395-followup.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 62e4e16)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants