move make_array array_append array_prepend array_concat function to datafusion-functions-array crate#9343
move make_array array_append array_prepend array_concat function to datafusion-functions-array crate#9343guojidan wants to merge 13 commits intoapache:mainfrom
Conversation
| /// [`ScalarUDFImpl`]: datafusion_expr::ScalarUDFImpl | ||
| macro_rules! make_udf_function { | ||
| ($UDF:ty, $EXPR_FN:ident, $($arg:ident)*, $DOC:expr , $SCALAR_UDF_FN:ident) => { | ||
| ($UDF:ty, $EXPR_FN:ident, $arg:ident, $DOC:expr , $SCALAR_UDF_FN:ident) => { |
There was a problem hiding this comment.
why is this change to vec? previous syntax is able to parse arbitrary args
There was a problem hiding this comment.
For function with an indefinite number of args, like ArrayConcat or MakeArray, previous syntax unable to handle well,
And we don't care about args name, ScalarFunction::args just need an Vec, So I think change to vec is reasonable
There was a problem hiding this comment.
I see.
Can we have another macro handle arbitrary args function?
like what we have before
macro_rules! scalar_expr {
($ENUM:ident, $FUNC:ident, $($arg:ident)*, $DOC:expr) => {
#[doc = $DOC]
pub fn $FUNC($($arg: Expr),*) -> Expr {
Expr::ScalarFunction(ScalarFunction::new(
built_in_function::BuiltinScalarFunction::$ENUM,
vec![$($arg),*],
))
}
};
}
macro_rules! nary_scalar_expr {
($ENUM:ident, $FUNC:ident, $DOC:expr) => {
#[doc = $DOC ]
pub fn $FUNC(args: Vec<Expr>) -> Expr {
Expr::ScalarFunction(ScalarFunction::new(
built_in_function::BuiltinScalarFunction::$ENUM,
args,
))
}
};
}
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
I think we can avoid wrapping single element into vec![], which is quite nice
There was a problem hiding this comment.
how about this, I am tested, it work well :
macro_rules! make_udf_function {
($UDF:ty, $EXPR_FN:ident, $($arg:ident)*, $DOC:expr , $SCALAR_UDF_FN:ident) => {
paste::paste! {
// "fluent expr_fn" style function
#[doc = $DOC]
pub fn $EXPR_FN($($arg: Expr),*) -> Expr {
Expr::ScalarFunction(ScalarFunction::new_udf(
$SCALAR_UDF_FN(),
vec![$($arg),*],
))
}
/// Singleton instance of [`$UDF`], ensures the UDF is only created once
/// named STATIC_$(UDF). For example `STATIC_ArrayToString`
#[allow(non_upper_case_globals)]
static [< STATIC_ $UDF >]: std::sync::OnceLock<std::sync::Arc<datafusion_expr::ScalarUDF>> =
std::sync::OnceLock::new();
/// ScalarFunction that returns a [`ScalarUDF`] for [`$UDF`]
///
/// [`ScalarUDF`]: datafusion_expr::ScalarUDF
pub fn $SCALAR_UDF_FN() -> std::sync::Arc<datafusion_expr::ScalarUDF> {
[< STATIC_ $UDF >]
.get_or_init(|| {
std::sync::Arc::new(datafusion_expr::ScalarUDF::new_from_impl(
<$UDF>::new(),
))
})
.clone()
}
}
};
($UDF:ty, $EXPR_FN:ident, $DOC:expr , $SCALAR_UDF_FN:ident) => {
paste::paste! {
// "fluent expr_fn" style function
#[doc = $DOC]
pub fn $EXPR_FN(arg: Vec<Expr>) -> Expr {
Expr::ScalarFunction(ScalarFunction::new_udf(
$SCALAR_UDF_FN(),
arg,
))
}
/// Singleton instance of [`$UDF`], ensures the UDF is only created once
/// named STATIC_$(UDF). For example `STATIC_ArrayToString`
#[allow(non_upper_case_globals)]
static [< STATIC_ $UDF >]: std::sync::OnceLock<std::sync::Arc<datafusion_expr::ScalarUDF>> =
std::sync::OnceLock::new();
/// ScalarFunction that returns a [`ScalarUDF`] for [`$UDF`]
///
/// [`ScalarUDF`]: datafusion_expr::ScalarUDF
pub fn $SCALAR_UDF_FN() -> std::sync::Arc<datafusion_expr::ScalarUDF> {
[< STATIC_ $UDF >]
.get_or_init(|| {
std::sync::Arc::new(datafusion_expr::ScalarUDF::new_from_impl(
<$UDF>::new(),
))
})
.clone()
}
}
};
}
| } | ||
|
|
||
| fn invoke(&self, args: &[ColumnarValue]) -> datafusion_common::Result<ColumnarValue> { | ||
| make_scalar_function_with_hints(crate::kernels::array_append)(args) |
There was a problem hiding this comment.
I think we dont need this. I play around it and find that there is error in
LargeList casting panics
If the list casting is fixed, probably we dont need make_scalar_function_with_hints anymore.
There was a problem hiding this comment.
yes, if no make_scalar_function_with_hints function, arrow-datafusion/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs will pannic, cause by arrow-rs not support cast LargeList, So At present, we still need make_scalar_function_with_hints
There was a problem hiding this comment.
we can use as_list::<i64> for large list, and as_list::<i32> for list
There was a problem hiding this comment.
but I think make_scalar_function_with_hints function have clear logic, determine output is Scalar or Array, I want
to keep it 🤔
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
I'm not sure if there is any case that we need Scalar here. If so, we definitely can keep make_scalar_function_with_hints , otherwise, we can add it back until we need it.
|
@jayzhan211 thank you very much for your review 😄 |
alamb
left a comment
There was a problem hiding this comment.
Thank you @guojidan and @jayzhan211 for the review. I am only concerned about the newly introduced dependnecy between datafusion-optimizer and datafusion-functions-array.... I wonder if we can find a way to avoid that dependency (maybe we could move the array specific rewrites into their own pass somehow 🤔 )
| encode(col("a").cast_to(&DataType::Utf8, &schema)?, lit("hex")), | ||
| decode(lit("1234"), lit("hex")), | ||
| array_to_string(array(vec![lit(1), lit(2), lit(3)]), lit(",")), | ||
| array_to_string(make_array(vec![lit(1), lit(2), lit(3)]), lit(",")), |
datafusion/optimizer/Cargo.toml
Outdated
| chrono = { workspace = true } | ||
| datafusion-common = { workspace = true } | ||
| datafusion-expr = { workspace = true } | ||
| datafusion-functions-array = { workspace = true } |
There was a problem hiding this comment.
I am not sure about this new dependency -- it means that using the optimizer will require bringing in the physical exprs, etc and users (like dask-sql) that only want the planner code would bring in substantial unused code.
I realize the reason this is required is to preserve the semantics of the existing rewrite pass in datafusion/optimizer/src/analyzer/rewrite_expr.rs. I wonder if we can somehow avoid adding this new dependency 🤔
There was a problem hiding this comment.
Maybe we can split udf to logical expr and physcial expr, and export only logical expr conditionally (with #[cfg(feature = "array_expressions")]) to optimizer, so we can do optimizing for udfs, but only import the necessary one for user.
There was a problem hiding this comment.
Is it possible to avoid importing function-arrays if we get udf via self.context_provider.get_function_meta(&name) ?
There was a problem hiding this comment.
Is it possible to avoid importing
function-arraysif we get udf viaself.context_provider.get_function_meta(&name)?
wow, This may be feasible
There was a problem hiding this comment.
hi @jayzhan211 , Do you think my current approach to the analyzer is feasible? I move array analyzer to datafusion-functions-array crate, add array analyzer to analyzer rules if array_expression featrue is enable
There was a problem hiding this comment.
Probably not a good idea, it seems like duplicated code without any good reason.
There was a problem hiding this comment.
@guojidan
let's see if it is possible to optionally import datafusion-functions-array, it seems an easier way than bringing udfs to optimizer.
|
hi @jayzhan211 , if |
I think it depends on whether we should place |
yes, I think |
#9100 (comment) |
as you said, we can use |
|
I converted it to the draft, feel free to ping me when the PR is ready to review |
Remember to replace them in rewriter too |
|
hi @jayzhan211 , sorry long time to reply, this pr's todo list is:
|
Nope |
difficult to implement, |
|
because this pr lasting for too long,difficult to rebase or merge, so I open a new pr #9504 , I will close this pr |
Which issue does this PR close?
Closes #9322 .
Rationale for this change
What changes are included in this PR?
move make_array array_append array_prepend array_concat function to datafusion-functions-array crate
Are these changes tested?
yes
Are there any user-facing changes?
no