Update Arrow/Parquet to `51.0.0`, tonic to `0.11` by tustvold · Pull Request #9613 · apache/datafusion

tustvold · 2024-03-14T22:54:56Z

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

tustvold · 2024-03-18T08:00:24Z

datafusion-examples/examples/deserialize_to_struct.rs

+use futures::StreamExt;

 /// This example shows that it is possible to convert query results into Rust structs .
-/// It will collect the query results into RecordBatch, then convert it to serde_json::Value.


apache/arrow-rs#5318 deprecated the serde_json based APIs

tustvold · 2024-03-18T08:00:43Z

datafusion/common/src/file_options/parquet_writer.rs

        "plain" => Ok(parquet::basic::Encoding::PLAIN),
        "plain_dictionary" => Ok(parquet::basic::Encoding::PLAIN_DICTIONARY),
        "rle" => Ok(parquet::basic::Encoding::RLE),
+        #[allow(deprecated)]


apache/arrow-rs#5318

I don't understand the reference (to the JSON writer) when this is for parquet encoding. Is there some other encoding/compression scheme that was deprecated too?

This is a copypasta meant to link to apache/arrow-rs#5348

tustvold · 2024-03-18T08:01:10Z

datafusion/functions/src/datetime/date_part.rs

-            "epoch" => extract_date_part!(&array, epoch),
-            _ => exec_err!("Date part '{date_part}' not supported"),
-        }?;
+        let arr = match part.to_lowercase().as_str() {


apache/arrow-rs#5318

As above, I don't understand the reference to the JSON writer PR

The changes in this module look more like switching to use the date_part kernels that @Jefffrey added in apache/arrow-rs#5319 and a cleanup of the code to use the unary kernel more effectively (the changes look good to me)

Yes this was a copypasta

alamb

Thank you for the contribution @tustvold -- I think this looks good overall (though I didn't understand several ticket references)

I also took the liberty of merging this PR up from main and running cargo update in datafusion-cli to get a clean CI run

alamb · 2024-03-18T10:24:27Z

datafusion-examples/examples/deserialize_to_struct.rs

+            let int_col = b.column(0).as_primitive::<Int32Type>();
+            let float_col = b.column(1).as_primitive::<Float64Type>();

-        // converts it to serde_json type and then convert that into Rust type


I do think showing how to use serde to convert arrow --> rust structs is important. While I am well aware its performance is not good, the serde concept is widely understood and supported in the rust Ecosystem.

Is there any API that can do serde into Rust structs in the core arrow crates anymore?

If not, perhaps we can point in comments at a crate like https://github.com/chmp/serde_arrow (or bring an example that parses the JSON back to Json::Value and then serde's)

We/I can do this as a follow on PR

You can serialize to JSON and parse it, but I would rather encourage people towards the performant way of doing things

Is there any API that can do serde into Rust structs in the core arrow crates anymore?

I'd dispute that we ever really had a way to do this, going via serde_json::Value is more of a hack than anything else. Serializing to a JSON string and back will likely be faster

I'd dispute that we ever really had a way to do this, going via serde_json::Value is more of a hack than anything else. Serializing to a JSON string and back will likely be faster

The key thing in my mind is to make it easy / quick for new users to get something working quickly. I am well aware that custom array -> struct will be the fastest performance, but I think it takes non trivial expertise in manipulating the arrow-rs API (especially when it comes to StructArray and ListArray) -- so offering them a fast way to get started with a slower API is important I think

I think since this is an example, we can always update / improve it as a follow on PR

alamb · 2024-03-18T10:26:20Z

datafusion/common/src/file_options/parquet_writer.rs

        "plain" => Ok(parquet::basic::Encoding::PLAIN),
        "plain_dictionary" => Ok(parquet::basic::Encoding::PLAIN_DICTIONARY),
        "rle" => Ok(parquet::basic::Encoding::RLE),
+        #[allow(deprecated)]


I don't understand the reference (to the JSON writer) when this is for parquet encoding. Is there some other encoding/compression scheme that was deprecated too?

alamb · 2024-03-18T10:26:45Z

datafusion/common/src/scalar/mod.rs


        #[rustfmt::skip]
-        let expected = [
+            let expected = [


This whitespace change seems unnecessary

alamb · 2024-03-18T10:26:49Z

datafusion/common/src/scalar/mod.rs

+            | DataType::Utf8View
+            | DataType::BinaryView
+            | DataType::ListView(_)
+            | DataType::LargeListView(_) => {


alamb · 2024-03-18T10:27:28Z

datafusion/core/src/datasource/file_format/parquet.rs

 use object_store::{ObjectMeta, ObjectStore};

-/// Size of the buffer for [`AsyncArrowWriter`].
-const PARQUET_WRITER_BUFFER_SIZE: usize = 10485760;


Due to apache/arrow-rs#5485

alamb · 2024-03-18T10:28:42Z

datafusion/functions/src/datetime/date_part.rs

-            "epoch" => extract_date_part!(&array, epoch),
-            _ => exec_err!("Date part '{date_part}' not supported"),
-        }?;
+        let arr = match part.to_lowercase().as_str() {


As above, I don't understand the reference to the JSON writer PR

The changes in this module look more like switching to use the date_part kernels that @Jefffrey added in apache/arrow-rs#5319 and a cleanup of the code to use the unary kernel more effectively (the changes look good to me)

alamb · 2024-03-18T17:11:05Z

Let's leave this one open for another day or two so there is at least one work day for for people to comment

tustvold · 2024-03-18T17:48:54Z

Apologies it was late and it would appear rather than linking to the appropriate tickets I just repeatedly linked to the same one

Jefffrey

👍

Jefffrey · 2024-03-18T20:22:36Z

datafusion-examples/Cargo.toml

@@ -75,6 +75,6 @@ serde_json = { workspace = true }
 tempfile = { workspace = true }
 tokio = { workspace = true, features = ["rt-multi-thread", "parking_lot"] }
 # 0.10 and 0.11 are incompatible. Need to upgrade tonic to 0.11 when upgrading to arrow 51


Suggested change

# 0.10 and 0.11 are incompatible. Need to upgrade tonic to 0.11 when upgrading to arrow 51

No longer need this comment

Removed in 456e2fe

Jefffrey · 2024-03-18T20:47:21Z

datafusion/functions/src/datetime/date_part.rs

+fn seconds(array: &dyn Array, unit: TimeUnit) -> Result<ArrayRef> {
+    let sf = match unit {
+        Second => 1_f64,
+        Millisecond => 1_000_f64,
+        Microsecond => 1_000_000_f64,
+        Nanosecond => 1_000_000_000_f64,
+    };
+    let secs = date_part(array, DatePart::Second)?;
+    let secs = as_int32_array(secs.as_ref())?;


Is it worth making a note somewhere here that array must be a PrimitiveArray? Otherwise the as_int32_array() can panic if it is a dictionary, and maybe want to make this clear to anyone viewing the code hoping to make it work for dictionary in the future?

(Previous code encoded this by having array be a &PrimitiveArray<T>)

Added in f550b64

alamb

I merged up from main to resolve a conflict and addressed comments. I plan to merge this in later today

alamb · 2024-03-19T15:02:23Z

datafusion-examples/examples/deserialize_to_struct.rs

+            let int_col = b.column(0).as_primitive::<Int32Type>();
+            let float_col = b.column(1).as_primitive::<Float64Type>();

-        // converts it to serde_json type and then convert that into Rust type


I think since this is an example, we can always update / improve it as a follow on PR

alamb · 2024-03-19T15:22:07Z

datafusion/functions/src/datetime/date_part.rs

+fn seconds(array: &dyn Array, unit: TimeUnit) -> Result<ArrayRef> {
+    let sf = match unit {
+        Second => 1_f64,
+        Millisecond => 1_000_f64,
+        Microsecond => 1_000_000_f64,
+        Nanosecond => 1_000_000_000_f64,
+    };
+    let secs = date_part(array, DatePart::Second)?;
+    let secs = as_int32_array(secs.as_ref())?;


Added in f550b64

alamb · 2024-03-19T15:22:32Z

datafusion-examples/Cargo.toml

@@ -75,6 +75,6 @@ serde_json = { workspace = true }
 tempfile = { workspace = true }
 tokio = { workspace = true, features = ["rt-multi-thread", "parking_lot"] }
 # 0.10 and 0.11 are incompatible. Need to upgrade tonic to 0.11 when upgrading to arrow 51


Removed in 456e2fe

alamb · 2024-03-19T16:22:09Z

🚀

Prepare for arrow 51

6ee1fe0

github-actions bot added the core Core DataFusion crate label Mar 14, 2024

Merge remote-tracking branch 'upstream/main' into prepare-arrow-upgrade

26c4df3

github-actions bot added the sql SQL Planner label Mar 14, 2024

tustvold added 3 commits March 15, 2024 12:24

Fix datafusion-proto

6d506b5

Update deserialize_to_struct

3267465

Format

e2d39a6

tustvold mentioned this pull request Mar 15, 2024

Prepare arrow 51.0.0 apache/arrow-rs#5516

Merged

Update pins

6dd8b0e

tustvold changed the title ~~Prepare for arrow 51~~ Update Arrow 51.0.0 Mar 18, 2024

tustvold commented Mar 18, 2024

View reviewed changes

tustvold marked this pull request as ready for review March 18, 2024 08:01

alamb added 2 commits March 18, 2024 06:14

Merge remote-tracking branch 'apache/main' into prepare-arrow-upgrade

0c2c918

Update datafusion-cli Cargo.lock

59332ee

alamb approved these changes Mar 18, 2024

View reviewed changes

This was referenced Mar 18, 2024

DataFusion weekly project plan (Andrew Lamb) - March 11, 2024 #9555

Closed

DataFusion weekly project plan (Andrew Lamb) - March 18, 2024 #9675

Closed

Jefffrey approved these changes Mar 18, 2024

View reviewed changes

This was referenced Mar 19, 2024

chore(deps): update arrow requirement from 50.0.0 to 51.0.0 #9688

Closed

chore(deps): update tonic requirement from 0.10 to 0.11 #9361

Closed

leoyvens mentioned this pull request Mar 19, 2024

Release DataFusion 37.0.0 #9682

Closed

8 tasks

alamb added 3 commits March 19, 2024 11:00

Merge remote-tracking branch 'apache/main' into prepare-arrow-upgrade

a8bc49b

Remove stale comment

456e2fe

Add comment to seconds

f550b64

alamb reviewed Mar 19, 2024

View reviewed changes

alamb changed the title ~~Update Arrow 51.0.0~~ Update Arrow/Paruqet to 51.0.0, tonic to 0.11 Mar 19, 2024

alamb changed the title ~~Update Arrow/Paruqet to 51.0.0, tonic to 0.11~~ Update Arrow/Parquet to 51.0.0, tonic to 0.11 Mar 19, 2024

alamb merged commit 7af69f9 into apache:main Mar 19, 2024

Jefffrey mentioned this pull request Mar 19, 2024

Support for extract(x from time) / date_part from time types #8693

Merged

waynexia mentioned this pull request Jun 2, 2024

build(deps): update Arrow/Parquet to 52.0, object-store to 0.10 #10765

Merged

Conversation

tustvold commented Mar 14, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tustvold Mar 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Mar 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tustvold commented Mar 18, 2024

Uh oh!

Jefffrey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Mar 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tustvold Mar 18, 2024 •

edited

Loading

alamb commented Mar 18, 2024 •

edited

Loading