Update Arrow/Parquet to 51.0.0, tonic to 0.11#9613
Conversation
| use futures::StreamExt; | ||
|
|
||
| /// This example shows that it is possible to convert query results into Rust structs . | ||
| /// It will collect the query results into RecordBatch, then convert it to serde_json::Value. |
There was a problem hiding this comment.
apache/arrow-rs#5318 deprecated the serde_json based APIs
| "plain" => Ok(parquet::basic::Encoding::PLAIN), | ||
| "plain_dictionary" => Ok(parquet::basic::Encoding::PLAIN_DICTIONARY), | ||
| "rle" => Ok(parquet::basic::Encoding::RLE), | ||
| #[allow(deprecated)] |
There was a problem hiding this comment.
There was a problem hiding this comment.
I don't understand the reference (to the JSON writer) when this is for parquet encoding. Is there some other encoding/compression scheme that was deprecated too?
There was a problem hiding this comment.
This is a copypasta meant to link to apache/arrow-rs#5348
| "epoch" => extract_date_part!(&array, epoch), | ||
| _ => exec_err!("Date part '{date_part}' not supported"), | ||
| }?; | ||
| let arr = match part.to_lowercase().as_str() { |
There was a problem hiding this comment.
There was a problem hiding this comment.
As above, I don't understand the reference to the JSON writer PR
The changes in this module look more like switching to use the date_part kernels that @Jefffrey added in apache/arrow-rs#5319 and a cleanup of the code to use the unary kernel more effectively (the changes look good to me)
There was a problem hiding this comment.
Yes this was a copypasta
alamb
left a comment
There was a problem hiding this comment.
Thank you for the contribution @tustvold -- I think this looks good overall (though I didn't understand several ticket references)
I also took the liberty of merging this PR up from main and running cargo update in datafusion-cli to get a clean CI run
| let int_col = b.column(0).as_primitive::<Int32Type>(); | ||
| let float_col = b.column(1).as_primitive::<Float64Type>(); | ||
|
|
||
| // converts it to serde_json type and then convert that into Rust type |
There was a problem hiding this comment.
I do think showing how to use serde to convert arrow --> rust structs is important. While I am well aware its performance is not good, the serde concept is widely understood and supported in the rust Ecosystem.
Is there any API that can do serde into Rust structs in the core arrow crates anymore?
If not, perhaps we can point in comments at a crate like https://github.com/chmp/serde_arrow (or bring an example that parses the JSON back to Json::Value and then serde's)
We/I can do this as a follow on PR
There was a problem hiding this comment.
You can serialize to JSON and parse it, but I would rather encourage people towards the performant way of doing things
Is there any API that can do serde into Rust structs in the core arrow crates anymore?
I'd dispute that we ever really had a way to do this, going via serde_json::Value is more of a hack than anything else. Serializing to a JSON string and back will likely be faster
There was a problem hiding this comment.
I'd dispute that we ever really had a way to do this, going via serde_json::Value is more of a hack than anything else. Serializing to a JSON string and back will likely be faster
The key thing in my mind is to make it easy / quick for new users to get something working quickly. I am well aware that custom array -> struct will be the fastest performance, but I think it takes non trivial expertise in manipulating the arrow-rs API (especially when it comes to StructArray and ListArray) -- so offering them a fast way to get started with a slower API is important I think
There was a problem hiding this comment.
I think since this is an example, we can always update / improve it as a follow on PR
| "plain" => Ok(parquet::basic::Encoding::PLAIN), | ||
| "plain_dictionary" => Ok(parquet::basic::Encoding::PLAIN_DICTIONARY), | ||
| "rle" => Ok(parquet::basic::Encoding::RLE), | ||
| #[allow(deprecated)] |
There was a problem hiding this comment.
I don't understand the reference (to the JSON writer) when this is for parquet encoding. Is there some other encoding/compression scheme that was deprecated too?
|
|
||
| #[rustfmt::skip] | ||
| let expected = [ | ||
| let expected = [ |
There was a problem hiding this comment.
This whitespace change seems unnecessary
| | DataType::Utf8View | ||
| | DataType::BinaryView | ||
| | DataType::ListView(_) | ||
| | DataType::LargeListView(_) => { |
| use object_store::{ObjectMeta, ObjectStore}; | ||
|
|
||
| /// Size of the buffer for [`AsyncArrowWriter`]. | ||
| const PARQUET_WRITER_BUFFER_SIZE: usize = 10485760; |
| "epoch" => extract_date_part!(&array, epoch), | ||
| _ => exec_err!("Date part '{date_part}' not supported"), | ||
| }?; | ||
| let arr = match part.to_lowercase().as_str() { |
There was a problem hiding this comment.
As above, I don't understand the reference to the JSON writer PR
The changes in this module look more like switching to use the date_part kernels that @Jefffrey added in apache/arrow-rs#5319 and a cleanup of the code to use the unary kernel more effectively (the changes look good to me)
|
Let's leave this one open for another day or two so there is at least one work day for for people to comment |
|
Apologies it was late and it would appear rather than linking to the appropriate tickets I just repeatedly linked to the same one |
datafusion-examples/Cargo.toml
Outdated
| @@ -75,6 +75,6 @@ serde_json = { workspace = true } | |||
| tempfile = { workspace = true } | |||
| tokio = { workspace = true, features = ["rt-multi-thread", "parking_lot"] } | |||
| # 0.10 and 0.11 are incompatible. Need to upgrade tonic to 0.11 when upgrading to arrow 51 | |||
There was a problem hiding this comment.
| # 0.10 and 0.11 are incompatible. Need to upgrade tonic to 0.11 when upgrading to arrow 51 |
No longer need this comment
| fn seconds(array: &dyn Array, unit: TimeUnit) -> Result<ArrayRef> { | ||
| let sf = match unit { | ||
| Second => 1_f64, | ||
| Millisecond => 1_000_f64, | ||
| Microsecond => 1_000_000_f64, | ||
| Nanosecond => 1_000_000_000_f64, | ||
| }; | ||
| let secs = date_part(array, DatePart::Second)?; | ||
| let secs = as_int32_array(secs.as_ref())?; |
There was a problem hiding this comment.
Is it worth making a note somewhere here that array must be a PrimitiveArray? Otherwise the as_int32_array() can panic if it is a dictionary, and maybe want to make this clear to anyone viewing the code hoping to make it work for dictionary in the future?
(Previous code encoded this by having array be a &PrimitiveArray<T>)
alamb
left a comment
There was a problem hiding this comment.
I merged up from main to resolve a conflict and addressed comments. I plan to merge this in later today
| let int_col = b.column(0).as_primitive::<Int32Type>(); | ||
| let float_col = b.column(1).as_primitive::<Float64Type>(); | ||
|
|
||
| // converts it to serde_json type and then convert that into Rust type |
There was a problem hiding this comment.
I think since this is an example, we can always update / improve it as a follow on PR
| fn seconds(array: &dyn Array, unit: TimeUnit) -> Result<ArrayRef> { | ||
| let sf = match unit { | ||
| Second => 1_f64, | ||
| Millisecond => 1_000_f64, | ||
| Microsecond => 1_000_000_f64, | ||
| Nanosecond => 1_000_000_000_f64, | ||
| }; | ||
| let secs = date_part(array, DatePart::Second)?; | ||
| let secs = as_int32_array(secs.as_ref())?; |
datafusion-examples/Cargo.toml
Outdated
| @@ -75,6 +75,6 @@ serde_json = { workspace = true } | |||
| tempfile = { workspace = true } | |||
| tokio = { workspace = true, features = ["rt-multi-thread", "parking_lot"] } | |||
| # 0.10 and 0.11 are incompatible. Need to upgrade tonic to 0.11 when upgrading to arrow 51 | |||
51.0.0, tonic to 0.11
51.0.0, tonic to 0.1151.0.0, tonic to 0.11
|
🚀 |
Which issue does this PR close?
Closes #.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?