Convert nth_value builtIn function to User Defined Window Function#13201
Convert nth_value builtIn function to User Defined Window Function#13201alamb merged 23 commits intoapache:mainfrom
nth_value builtIn function to User Defined Window Function#13201Conversation
|
THis is so exciting. FYI @jonathanc-n and @Omega359 |
I personally think it would be fine to leave Perhaps you can leave a stub in like enum BuiltInWindowFunction {
// Never created, will be removed in a follow on PR
Stub
};Then we can focus this PR on making sure that |
Thanks @alamb will continue with what you've said |
8cf82f0 to
fda6a6f
Compare
|
Wanted to update here. I think I'm almost finished but probably encountered a side effect. This query fails in slt file: I hope to fix this and make this ready tomorrow |
In the built-in (older) version the output field is defined like: fn field(&self) -> Result<Field> {
let nullable = true;
Ok(Field::new(&self.name, self.data_type.clone(), nullable))
}In the current code, the data type of the field is hard-coded as fn field(&self, field_args: WindowUDFFieldArgs) -> Result<Field> {
let nullable = true;
Ok(Field::new(field_args.name(), DataType::UInt64, nullable))
}To fix this use |
0706334 to
fddbc58
Compare
Thanks, @jcsherin that was the fix. I've fixed that issue but encountered another one. I return Error from partition evaluator but I think it is not honored. But it should not succeed since: |
TL;DRFor invalid input expressions, built-in window functions fail early when converting logical plan to physical plan. But user-defined window functions will complete planning, and fail only during physical execution. Validation of input expressions in user-defined window runs only during physical execution. In this case is it not better for udwf to fail early when converting to physical plan? A possible solution is to update datafusion/datafusion/physical-plan/src/windows/mod.rs Lines 158 to 164 in b61b2fc Edge Case: Empty TableDataFusion CLI v42.2.0
> CREATE TABLE t1(v1 BIGINT);
0 row(s) fetched.
Elapsed 0.020 seconds.There are currently no rows in datafusion/datafusion/physical-plan/src/windows/window_agg_exec.rs Lines 319 to 321 in 89e96b4 The > SELECT NTH_VALUE('+Inf'::Double, v1) OVER (PARTITION BY v1) FROM t1;
+-------------------------------------------------------------------------------------------------------------+
| nth_value(Utf8("+Inf"),t1.v1) PARTITION BY [t1.v1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING |
+-------------------------------------------------------------------------------------------------------------+
+-------------------------------------------------------------------------------------------------------------+
0 row(s) fetched.
Elapsed 0.018 seconds.After we insert a few values into > insert into t1 values (123), (456);
+-------+
| count |
+-------+
| 2 |
+-------+
1 row(s) fetched.
Elapsed 0.007 seconds.
> SELECT NTH_VALUE('+Inf'::Double, v1) OVER (PARTITION BY v1) FROM t1;
This feature is not implemented: There is only support Literal types for field at idx: 1 in Window FunctionPlanning divergence between built-in & user-defined window functionsIn > EXPLAIN SELECT NTH_VALUE('+Inf'::Double, v1) OVER (PARTITION BY v1) FROM t1;
+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: NTH_VALUE(Utf8("+Inf"),t1.v1) PARTITION BY [t1.v1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING |
| | WindowAggr: windowExpr=[[NTH_VALUE(Float64(inf), t1.v1) PARTITION BY [t1.v1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS NTH_VALUE(Utf8("+Inf"),t1.v1) PARTITION BY [t1.v1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING]] |
| | TableScan: t1 projection=[v1] |
+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.009 seconds.But this is not the case for user-defined window functions. In this branch we instead see that a complete plan is built and failure is happening only when the query executes, > EXPLAIN SELECT NTH_VALUE('+Inf'::Double, v1) OVER (PARTITION BY v1) FROM t1;
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: nth_value(Utf8("+Inf"),t1.v1) PARTITION BY [t1.v1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING |
| | WindowAggr: windowExpr=[[nth_value(Float64(inf), t1.v1) PARTITION BY [t1.v1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS nth_value(Utf8("+Inf"),t1.v1) PARTITION BY [t1.v1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING]] |
| | TableScan: t1 projection=[v1] |
| physical_plan | ProjectionExec: expr=[nth_value(Utf8("+Inf"),t1.v1) PARTITION BY [t1.v1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING@1 as nth_value(Utf8("+Inf"),t1.v1) PARTITION BY [t1.v1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING] |
| | WindowAggExec: wdw=[nth_value(Utf8("+Inf"),t1.v1) PARTITION BY [t1.v1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING: Ok(Field { name: "nth_value(Utf8(\"+Inf\"),t1.v1) PARTITION BY [t1.v1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Rows, start_bound: Preceding(UInt64(NULL)), end_bound: Following(UInt64(NULL)), is_causal: false }] |
| | SortExec: expr=[v1@0 ASC NULLS LAST], preserve_partitioning=[false] |
| | MemoryExec: partitions=1, partition_sizes=[0] |
| | |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.019 seconds. |
|
@jcsherin thanks for the very detailed explanation. In this case, I think it would be better to update WindowUDFImpl in a followup PR for enhancement right? I can skip this test case in the scope of this PR. Correct me if I'm wrong please |
Sure, we can improve the API in another PR. Here is a workaround that fixes the failing test: // In datafusion/physical-plan/src/windows/mod.rs
fn create_udwf_window_expr(
fun: &Arc<WindowUDF>,
args: &[Arc<dyn PhysicalExpr>],
input_schema: &Schema,
name: String,
ignore_nulls: bool,
) -> Result<Arc<dyn BuiltInWindowFunctionExpr>> {
// need to get the types into an owned vec for some reason
let input_types: Vec<_> = args
.iter()
.map(|arg| arg.data_type(input_schema))
.collect::<Result<_>>()?;
let udwf_expr =
Arc::new(WindowUDFExpr {
fun: Arc::clone(fun),
args: args.to_vec(),
input_types,
name,
is_reversed: false,
ignore_nulls,
});
/// Early validation of input expressions
///
/// We create a partition evaluator because in the user-defined window
/// implementation this is where code for parsing input expressions
/// exist. The benefits are:
/// - If any of the input expressions are invalid we catch them early
/// in the planning phase, rather than during execution.
/// - Maintains compatibility with built-in (now removed) window
/// functions validation behavior.
/// - Predictable and reliable error handling.
///
/// See discussion here:
/// https://github.com/apache/datafusion/pull/13201#issuecomment-2454209975
let _ = udwf_expr.create_evaluator()?;
Ok(udwf_expr)
}I verified that this works in your branch. DataFusion CLI v42.2.0
> CREATE TABLE t1(v1 BIGINT);
0 row(s) fetched.
Elapsed 0.019 seconds.
> EXPLAIN SELECT NTH_VALUE('+Inf'::Double, v1) OVER (PARTITION BY v1) FROM t1;
+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: nth_value(Utf8("+Inf"),t1.v1) PARTITION BY [t1.v1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING |
| | WindowAggr: windowExpr=[[nth_value(Float64(inf), t1.v1) PARTITION BY [t1.v1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS nth_value(Utf8("+Inf"),t1.v1) PARTITION BY [t1.v1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING]] |
| | TableScan: t1 projection=[v1] |
+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.018 seconds.
> SELECT NTH_VALUE('+Inf'::Double, v1) OVER (PARTITION BY v1) FROM t1;
This feature is not implemented: There is only support Literal types for field at idx: 1 in Window FunctionThis workaround may not be ideal, but at least we do not have to skip this test. Also please feel free to update the code/comments as you see fit. |
|
|
||
| /// Create an expression to represent the `nth_value` window function | ||
| /// | ||
| pub fn nth_value(arg: datafusion_expr::Expr, n: Option<i64>) -> datafusion_expr::Expr { |
There was a problem hiding this comment.
The type of n is i64, not Option<i64>.
See the rust docs: https://docs.rs/datafusion/latest/datafusion/logical_expr/window_function/fn.nth_value.html
There was a problem hiding this comment.
Also add a roundtrip logical plan test for this API here:
There was a problem hiding this comment.
Fixed and added test
|
@buraksenn Tremendous effort 🙌. These changes look good to me. |
Co-authored-by: Sherin Jacob <jacob@protoship.io>
|
@buraksenn and @berkaysynnada Thanks! @alamb This PR is ready. |
|
Awesome -- thank you so much. I will review this PR hopefully later today |
alamb
left a comment
There was a problem hiding this comment.
Thank you so much @buraksenn , @jcsherin -- it is just so beautiful to see this PR now after all the work. It is basically perfect from my perspective 🏆
| [dev-dependencies] | ||
| criterion = { version = "0.5", features = ["async_futures"] } | ||
| datafusion-functions-aggregate = { workspace = true } | ||
| datafusion-functions-window = { workspace = true } |
There was a problem hiding this comment.
some day I hope we can remove these dependencies (so we can make testing physical-plan faster, but not a part of this PR
| // We create a partition evaluator because in the user-defined window | ||
| // implementation this is where code for parsing input expressions | ||
| // exist. The benefits are: | ||
| // - If any of the input expressions are invalid we catch them early |
There was a problem hiding this comment.
💯 for these comments that explain the rationale
nth_value builtIn function to User Defined Window Function
|
I also took the liberty of merging up from main to make sure we haven't hit any logical conflicts with this PR |
|
I don't think there is any reason to wait around for this PR -- people know it is coming, so let's get this in 🚀 |
…pache#13201) * refactored nth_value * continue * test * proto and rustlint * fix datatype * cont * cont * apply jcsherins early validation * docs * doc * Apply suggestions from code review Co-authored-by: Sherin Jacob <jacob@protoship.io> * passes lint but does not have tests * continue * Update roundtrip_physical_plan.rs * udwf, not udaf * fix bounded but not fixed roundtrip * added * Update datafusion/sqllogictest/test_files/errors.slt Co-authored-by: Sherin Jacob <jacob@protoship.io> --------- Co-authored-by: Sherin Jacob <jacob@protoship.io> Co-authored-by: berkaysynnada <berkay.sahin@synnada.ai> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…pache#13201) * refactored nth_value * continue * test * proto and rustlint * fix datatype * cont * cont * apply jcsherins early validation * docs * doc * Apply suggestions from code review Co-authored-by: Sherin Jacob <jacob@protoship.io> * passes lint but does not have tests * continue * Update roundtrip_physical_plan.rs * udwf, not udaf * fix bounded but not fixed roundtrip * added * Update datafusion/sqllogictest/test_files/errors.slt Co-authored-by: Sherin Jacob <jacob@protoship.io> --------- Co-authored-by: Sherin Jacob <jacob@protoship.io> Co-authored-by: berkaysynnada <berkay.sahin@synnada.ai> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> (cherry picked from commit 54ab128)
…pache#13201) v44 * refactored nth_value * continue * test * proto and rustlint * fix datatype * cont * cont * apply jcsherins early validation * docs * doc * Apply suggestions from code review Co-authored-by: Sherin Jacob <jacob@protoship.io> * passes lint but does not have tests * continue * Update roundtrip_physical_plan.rs * udwf, not udaf * fix bounded but not fixed roundtrip * added * Update datafusion/sqllogictest/test_files/errors.slt Co-authored-by: Sherin Jacob <jacob@protoship.io> --------- Co-authored-by: Sherin Jacob <jacob@protoship.io> Co-authored-by: berkaysynnada <berkay.sahin@synnada.ai> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> (cherry picked from commit 54ab128)
Which issue does this PR close?
Closes #12649
Rationale for this change
Context: #8709
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
no