Describe the bug
Reading Parquet file with timestamp column containing a future date like 9999-12-31 02:00:00 year results in overflow panic with the following output:
thread 'tokio-runtime-worker' panicked at 'attempt to multiply with overflow'
To Reproduce
Steps to reproduce the behavior:
- Download the attached zip file that contains the parquet file: data-dimension-vehicle-20210609T222533Z-4cols-14rows.parquet.zip
- Unzip it and it should give you the
data-dimension-vehicle-20210609T222533Z-4cols-14rows.parquet file.
- Create a new project with
cargo new read-parquet, create a data folder in your project and put the parquet file in the data folder inside your project.
- Modify the
Cargo.toml file to contain the following:
[package]
name = "read-parquet"
version = "0.1.0"
edition = "2021"
[dependencies]
tokio = "1.14"
arrow = "6.0"
datafusion = "6.0"
- Put the following code in
main.rs to read the given parquet file:
use datafusion::prelude::*;
#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
let mut ctx = ExecutionContext::new();
/*
* Parquet file schema:
*
* message spark_schema {
* optional binary licence_code (UTF8);
* optional binary vehicle_make (UTF8);
* optional binary fuel_type (UTF8);
* optional int96 dimension_load_date;
* }
*/
ctx
.register_parquet("vehicles", "./data/data-dimension-vehicle-20210609T222533Z-4cols-14rows.parquet")
.await?;
let df = ctx
.sql("
SELECT
licence_code,
vehicle_make,
fuel_type,
CAST(dimension_load_date as TIMESTAMP) as dms
FROM vehicles
LiMIT 10
")
.await?;
df
.show()
.await?;
Ok(())
}
- Execute
cargo run.
- Result:
thread 'tokio-runtime-worker' panicked at 'attempt to multiply with overflow', /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-6.2.0/src/arrow/converter.rs:179:46
stack backtrace:
0: rust_begin_unwind
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/std/src/panicking.rs:498:5
1: core::panicking::panic_fmt
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/panicking.rs:107:14
2: core::panicking::panic
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/panicking.rs:48:5
3: <parquet::arrow::converter::Int96ArrayConverter as parquet::arrow::converter::Converter<alloc::vec::Vec<core::option::Option<parquet::data_type::Int96>>,arrow::array::array_primitive::PrimitiveArray<arrow::datatypes::types::TimestampNanosecondType>>>::convert::{{closure}}::{{closure}}
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-6.2.0/src/arrow/converter.rs:179:46
4: core::option::Option<T>::map
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/option.rs:846:29
5: <parquet::arrow::converter::Int96ArrayConverter as parquet::arrow::converter::Converter<alloc::vec::Vec<core::option::Option<parquet::data_type::Int96>>,arrow::array::array_primitive::PrimitiveArray<arrow::datatypes::types::TimestampNanosecondType>>>::convert::{{closure}}
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-6.2.0/src/arrow/converter.rs:179:30
6: core::iter::adapters::map::map_fold::{{closure}}
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/iter/adapters/map.rs:84:28
7: core::iter::traits::iterator::Iterator::fold
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/iter/traits/iterator.rs:2171:21
8: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/iter/adapters/map.rs:124:9
9: core::iter::traits::iterator::Iterator::for_each
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/iter/traits/iterator.rs:737:9
10: <alloc::vec::Vec<T,A> as alloc::vec::spec_extend::SpecExtend<T,I>>::spec_extend
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/alloc/src/vec/spec_extend.rs:40:17
11: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/alloc/src/vec/spec_from_iter_nested.rs:56:9
12: alloc::vec::source_iter_marker::<impl alloc::vec::spec_from_iter::SpecFromIter<T,I> for alloc::vec::Vec<T>>::from_iter
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/alloc/src/vec/source_iter_marker.rs:31:20
13: <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/alloc/src/vec/mod.rs:2549:9
14: core::iter::traits::iterator::Iterator::collect
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/iter/traits/iterator.rs:1745:9
15: <parquet::arrow::converter::Int96ArrayConverter as parquet::arrow::converter::Converter<alloc::vec::Vec<core::option::Option<parquet::data_type::Int96>>,arrow::array::array_primitive::PrimitiveArray<arrow::datatypes::types::TimestampNanosecondType>>>::convert
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-6.2.0/src/arrow/converter.rs:177:13
16: <parquet::arrow::converter::ArrayRefConverter<S,A,C> as parquet::arrow::converter::Converter<S,alloc::sync::Arc<dyn arrow::array::array::Array>>>::convert
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-6.2.0/src/arrow/converter.rs:450:9
17: <parquet::arrow::array_reader::ComplexObjectArrayReader<T,C> as parquet::arrow::array_reader::ArrayReader>::next_batch
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-6.2.0/src/arrow/array_reader.rs:545:25
18: <parquet::arrow::array_reader::StructArrayReader as parquet::arrow::array_reader::ArrayReader>::next_batch::{{closure}}
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-6.2.0/src/arrow/array_reader.rs:1130:27
19: core::iter::adapters::map::map_try_fold::{{closure}}
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/iter/adapters/map.rs:91:28
20: core::iter::traits::iterator::Iterator::try_fold
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/iter/traits/iterator.rs:1995:21
21: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/iter/adapters/map.rs:117:9
22: <parquet::arrow::array_reader::StructArrayReader as parquet::arrow::array_reader::ArrayReader>::next_batch
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-6.2.0/src/arrow/array_reader.rs:1127:30
23: <parquet::arrow::arrow_reader::ParquetRecordBatchReader as core::iter::traits::iterator::Iterator>::next
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-6.2.0/src/arrow/arrow_reader.rs:175:15
24: datafusion::physical_plan::file_format::parquet::read_partition
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/physical_plan/file_format/parquet.rs:424:19
25: <datafusion::physical_plan::file_format::parquet::ParquetExec as datafusion::physical_plan::ExecutionPlan>::execute::{{closure}}::{{closure}}
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/physical_plan/file_format/parquet.rs:213:29
26: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/blocking/task.rs:42:21
27: tokio::runtime::task::core::CoreStage<T>::poll::{{closure}}
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/core.rs:161:17
28: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/loom/std/unsafe_cell.rs:14:9
29: tokio::runtime::task::core::CoreStage<T>::poll
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/core.rs:151:13
30: tokio::runtime::task::harness::poll_future::{{closure}}
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/harness.rs:461:19
31: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/panic/unwind_safe.rs:271:9
32: std::panicking::try::do_call
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/std/src/panicking.rs:406:40
33: <unknown>
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/physical_plan/distinct_expressions.rs:127:15
34: std::panicking::try
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/std/src/panicking.rs:370:19
35: std::panic::catch_unwind
at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/std/src/panic.rs:133:14
36: tokio::runtime::task::harness::poll_future
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/harness.rs:449:18
37: tokio::runtime::task::harness::Harness<T,S>::poll_inner
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/harness.rs:98:27
38: tokio::runtime::task::harness::Harness<T,S>::poll
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/harness.rs:53:15
39: tokio::runtime::task::raw::poll
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/raw.rs:113:5
40: tokio::runtime::task::raw::RawTask::poll
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/raw.rs:70:18
41: tokio::runtime::task::UnownedTask<S>::run
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/mod.rs:379:9
42: tokio::runtime::blocking::pool::Inner::run
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/blocking/pool.rs:264:17
43: tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}
at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/blocking/pool.rs:244:17
Expected behavior
To be able to read that parquet file. The parquet file can be read with parquet-tools CLI and Apache Spark.
Additional context
The root cause is the fact that the parquet file contains some rows with 9999-12-31 02:00:00 in the dimension_load_date column. This future date is supported by Parquet and Spark.
The content of the parquet file is:
+------------+------------------+------------------+-------------------+
|licence_code|vehicle_make |fuel_type |dimension_load_date|
+------------+------------------+------------------+-------------------+
|odc-odbl |**Not Provided** |**Not Provided** |9999-12-31 02:00:00|
|odc-odbl |**Not Applicable**|**Not Applicable**|9998-12-31 02:00:00|
|odc-odbl |SAVIEM |Petrol |2021-06-09 03:02:37|
|odc-odbl |YAMAHA |Petrol |2021-06-09 03:43:47|
|odc-odbl |VAUXHALL |Petrol |2020-10-18 03:23:47|
|odc-odbl |VAUXHALL |Petrol |2021-06-09 03:02:37|
|odc-odbl |BMW |Petrol |2021-06-09 03:38:39|
|odc-odbl |MG |Petrol |2020-10-18 03:23:47|
|odc-odbl |PEUGEOT |Diesel |2020-10-18 03:35:16|
|odc-odbl |FORD |Diesel |2020-10-18 03:23:47|
|odc-odbl |FORD |Petrol |2020-10-18 03:12:55|
|odc-odbl |SKODA |Diesel |2021-06-09 03:02:37|
|odc-odbl |SHOGUN |Diesel |2020-10-18 03:12:55|
|odc-odbl |MITSUBISHI |Diesel |2021-06-10 01:15:47|
+------------+------------------+------------------+-------------------+
To find out more about how the root cause was detected you can follow apache/datafusion#1359.
Describe the bug
Reading Parquet file with timestamp column containing a future date like
9999-12-31 02:00:00year results in overflow panic with the following output:To Reproduce
Steps to reproduce the behavior:
data-dimension-vehicle-20210609T222533Z-4cols-14rows.parquetfile.cargo new read-parquet, create adatafolder in your project and put the parquet file in thedatafolder inside your project.Cargo.tomlfile to contain the following:main.rsto read the given parquet file:cargo run.Expected behavior
To be able to read that parquet file. The parquet file can be read with
parquet-toolsCLI and Apache Spark.Additional context
The root cause is the fact that the parquet file contains some rows with
9999-12-31 02:00:00in thedimension_load_datecolumn. This future date is supported by Parquet and Spark.The content of the parquet file is:
To find out more about how the root cause was detected you can follow apache/datafusion#1359.