[SPARK-34794][SQL] Fix nested transform issue#31887
Conversation
|
Maybe you can make a test with an example of the shadowing to demonstrate how it works correctly now. |
|
I think your example from slack would be prefect as a test. Especially because of the divergent behavior between Spark SQL and the programmatic api |
|
@nvander1 test added |
RussellSpitzer
left a comment
There was a problem hiding this comment.
small nit, looks good to me other than that
|
pinging @maropu, @HyukjinKwon |
|
ok to test |
|
|
| assert(ex3.getMessage.contains("cannot resolve 'a'")) | ||
| } | ||
|
|
||
| test("nested transform (DSL)") { |
There was a problem hiding this comment.
nit: nested transform (DSL) -> SPARK-34794: nested transform
| } | ||
|
|
||
| // counter to ensure lambda variable names are unique | ||
| private val lambdaVarNameCounter = new AtomicInteger(0) |
There was a problem hiding this comment.
How about defining a companion object for UnresolvedNamedLambdaVariable like this?
object UnresolvedNamedLambdaVariable {
// counter to ensure lambda variable names are unique
private val lambdaVarNameCounter = new AtomicInteger(0)
def apply(args: Seq[String]): UnresolvedNamedLambdaVariable = {
// add a counter number in the suffix
}
}
| flatten( | ||
| transform( | ||
| $"numbers", | ||
| (number: Column) => transform( |
There was a problem hiding this comment.
I think we need exhaustive tests for this issue, e.g., two/three argument cases and their combination cases.
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #136924 has finished for PR 31887 at commit
|
|
Kindly ping. |
|
@maropu I won't be able to work on this until June 1, other duties are getting in the way. |
|
cc @ueshin too FYI |
Ah, I see. Is it okay that I take this over? |
Go ahead. |
…e functions ### What changes were proposed in this pull request? To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions. This is the rework of #31887. Closes #31887. ### Why are the changes needed? This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable) For this query: ``` val df = Seq( (Seq(1,2,3), Seq("a", "b", "c")) ).toDF("numbers", "letters") df.select( f.flatten( f.transform( $"numbers", (number: Column) => { f.transform( $"letters", (letter: Column) => { f.struct( number.as("number"), letter.as("letter") ) } ) } ) ).as("zipped") ).show(10, false) ``` This is the current (incorrect) output: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| +------------------------------------------------------------------------+ ``` And this is the correct output after fix: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| +------------------------------------------------------------------------+ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added the new test in `DataFrameFunctionsSuite`. Closes #32424 from maropu/pr31887. Lead-authored-by: dsolow <dsolow@sayari.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by: dmsolow <dsolow@sayarianalytics.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit f550e03) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
…e functions ### What changes were proposed in this pull request? To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions. This is the rework of #31887. Closes #31887. ### Why are the changes needed? This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable) For this query: ``` val df = Seq( (Seq(1,2,3), Seq("a", "b", "c")) ).toDF("numbers", "letters") df.select( f.flatten( f.transform( $"numbers", (number: Column) => { f.transform( $"letters", (letter: Column) => { f.struct( number.as("number"), letter.as("letter") ) } ) } ) ).as("zipped") ).show(10, false) ``` This is the current (incorrect) output: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| +------------------------------------------------------------------------+ ``` And this is the correct output after fix: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| +------------------------------------------------------------------------+ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added the new test in `DataFrameFunctionsSuite`. Closes #32424 from maropu/pr31887. Lead-authored-by: dsolow <dsolow@sayari.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by: dmsolow <dsolow@sayarianalytics.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit f550e03) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
…e functions ### What changes were proposed in this pull request? To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions. This is the rework of apache#31887. Closes apache#31887. ### Why are the changes needed? This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable) For this query: ``` val df = Seq( (Seq(1,2,3), Seq("a", "b", "c")) ).toDF("numbers", "letters") df.select( f.flatten( f.transform( $"numbers", (number: Column) => { f.transform( $"letters", (letter: Column) => { f.struct( number.as("number"), letter.as("letter") ) } ) } ) ).as("zipped") ).show(10, false) ``` This is the current (incorrect) output: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| +------------------------------------------------------------------------+ ``` And this is the correct output after fix: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| +------------------------------------------------------------------------+ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added the new test in `DataFrameFunctionsSuite`. Closes apache#32424 from maropu/pr31887. Lead-authored-by: dsolow <dsolow@sayari.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by: dmsolow <dsolow@sayarianalytics.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit f550e03) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit 6df4ec0) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
…e functions ### What changes were proposed in this pull request? To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions. This is the rework of apache#31887. Closes apache#31887. ### Why are the changes needed? This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable) For this query: ``` val df = Seq( (Seq(1,2,3), Seq("a", "b", "c")) ).toDF("numbers", "letters") df.select( f.flatten( f.transform( $"numbers", (number: Column) => { f.transform( $"letters", (letter: Column) => { f.struct( number.as("number"), letter.as("letter") ) } ) } ) ).as("zipped") ).show(10, false) ``` This is the current (incorrect) output: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| +------------------------------------------------------------------------+ ``` And this is the correct output after fix: ``` +------------------------------------------------------------------------+ |zipped | +------------------------------------------------------------------------+ |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| +------------------------------------------------------------------------+ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added the new test in `DataFrameFunctionsSuite`. Closes apache#32424 from maropu/pr31887. Lead-authored-by: dsolow <dsolow@sayari.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by: dmsolow <dsolow@sayarianalytics.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit f550e03) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
What changes were proposed in this pull request?
Increment a global counter and use the current value to name LambdaVariables created by higher order functions.
Why are the changes needed?
This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable)
For this query:
This is the current (incorrect) output:
And this is the correct output after fix:
Does this PR introduce any user-facing change?
No
How was this patch tested?
I'm not sure how to test this because the current tests don't run the
functions.transformdirectly@nvander1 @RussellSpitzer