Skip to content

Reading Parquet file with timestamp column with 9999 year results in overflow panic #982

@andrei-ionescu

Description

@andrei-ionescu

Describe the bug
Reading Parquet file with timestamp column containing a future date like 9999-12-31 02:00:00 year results in overflow panic with the following output:

thread 'tokio-runtime-worker' panicked at 'attempt to multiply with overflow'

To Reproduce
Steps to reproduce the behavior:

  1. Download the attached zip file that contains the parquet file: data-dimension-vehicle-20210609T222533Z-4cols-14rows.parquet.zip
  2. Unzip it and it should give you the data-dimension-vehicle-20210609T222533Z-4cols-14rows.parquet file.
  3. Create a new project with cargo new read-parquet, create a data folder in your project and put the parquet file in the data folder inside your project.
  4. Modify the Cargo.toml file to contain the following:
[package]
name = "read-parquet"
version = "0.1.0"
edition = "2021"

[dependencies]
tokio = "1.14"
arrow = "6.0"
datafusion = "6.0"
  1. Put the following code in main.rs to read the given parquet file:
use datafusion::prelude::*;

#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
    let mut ctx = ExecutionContext::new(); 
    /* 
     * Parquet file schema:
     *
     * message spark_schema {
     *   optional binary licence_code (UTF8);
     *   optional binary vehicle_make (UTF8);
     *   optional binary fuel_type (UTF8);
     *   optional int96 dimension_load_date;
     * }
     */
    ctx
        .register_parquet("vehicles", "./data/data-dimension-vehicle-20210609T222533Z-4cols-14rows.parquet")
        .await?;
    let df = ctx
        .sql("
            SELECT
                licence_code,
                vehicle_make,
                fuel_type,
                CAST(dimension_load_date as TIMESTAMP) as dms
            FROM vehicles
            LiMIT 10
        ")
        .await?;

    df
        .show()
        .await?;

    Ok(())
}
  1. Execute cargo run.
  2. Result:
thread 'tokio-runtime-worker' panicked at 'attempt to multiply with overflow', /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-6.2.0/src/arrow/converter.rs:179:46
stack backtrace:
   0: rust_begin_unwind
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/std/src/panicking.rs:498:5
   1: core::panicking::panic_fmt
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/panicking.rs:107:14
   2: core::panicking::panic
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/panicking.rs:48:5
   3: <parquet::arrow::converter::Int96ArrayConverter as parquet::arrow::converter::Converter<alloc::vec::Vec<core::option::Option<parquet::data_type::Int96>>,arrow::array::array_primitive::PrimitiveArray<arrow::datatypes::types::TimestampNanosecondType>>>::convert::{{closure}}::{{closure}}
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-6.2.0/src/arrow/converter.rs:179:46
   4: core::option::Option<T>::map
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/option.rs:846:29
   5: <parquet::arrow::converter::Int96ArrayConverter as parquet::arrow::converter::Converter<alloc::vec::Vec<core::option::Option<parquet::data_type::Int96>>,arrow::array::array_primitive::PrimitiveArray<arrow::datatypes::types::TimestampNanosecondType>>>::convert::{{closure}}
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-6.2.0/src/arrow/converter.rs:179:30
   6: core::iter::adapters::map::map_fold::{{closure}}
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/iter/adapters/map.rs:84:28
   7: core::iter::traits::iterator::Iterator::fold
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/iter/traits/iterator.rs:2171:21
   8: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/iter/adapters/map.rs:124:9
   9: core::iter::traits::iterator::Iterator::for_each
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/iter/traits/iterator.rs:737:9
  10: <alloc::vec::Vec<T,A> as alloc::vec::spec_extend::SpecExtend<T,I>>::spec_extend
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/alloc/src/vec/spec_extend.rs:40:17
  11: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/alloc/src/vec/spec_from_iter_nested.rs:56:9
  12: alloc::vec::source_iter_marker::<impl alloc::vec::spec_from_iter::SpecFromIter<T,I> for alloc::vec::Vec<T>>::from_iter
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/alloc/src/vec/source_iter_marker.rs:31:20
  13: <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/alloc/src/vec/mod.rs:2549:9
  14: core::iter::traits::iterator::Iterator::collect
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/iter/traits/iterator.rs:1745:9
  15: <parquet::arrow::converter::Int96ArrayConverter as parquet::arrow::converter::Converter<alloc::vec::Vec<core::option::Option<parquet::data_type::Int96>>,arrow::array::array_primitive::PrimitiveArray<arrow::datatypes::types::TimestampNanosecondType>>>::convert
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-6.2.0/src/arrow/converter.rs:177:13
  16: <parquet::arrow::converter::ArrayRefConverter<S,A,C> as parquet::arrow::converter::Converter<S,alloc::sync::Arc<dyn arrow::array::array::Array>>>::convert
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-6.2.0/src/arrow/converter.rs:450:9
  17: <parquet::arrow::array_reader::ComplexObjectArrayReader<T,C> as parquet::arrow::array_reader::ArrayReader>::next_batch
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-6.2.0/src/arrow/array_reader.rs:545:25
  18: <parquet::arrow::array_reader::StructArrayReader as parquet::arrow::array_reader::ArrayReader>::next_batch::{{closure}}
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-6.2.0/src/arrow/array_reader.rs:1130:27
  19: core::iter::adapters::map::map_try_fold::{{closure}}
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/iter/adapters/map.rs:91:28
  20: core::iter::traits::iterator::Iterator::try_fold
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/iter/traits/iterator.rs:1995:21
  21: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/iter/adapters/map.rs:117:9
  22: <parquet::arrow::array_reader::StructArrayReader as parquet::arrow::array_reader::ArrayReader>::next_batch
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-6.2.0/src/arrow/array_reader.rs:1127:30
  23: <parquet::arrow::arrow_reader::ParquetRecordBatchReader as core::iter::traits::iterator::Iterator>::next
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet-6.2.0/src/arrow/arrow_reader.rs:175:15
  24: datafusion::physical_plan::file_format::parquet::read_partition
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/physical_plan/file_format/parquet.rs:424:19
  25: <datafusion::physical_plan::file_format::parquet::ParquetExec as datafusion::physical_plan::ExecutionPlan>::execute::{{closure}}::{{closure}}
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/physical_plan/file_format/parquet.rs:213:29
  26: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/blocking/task.rs:42:21
  27: tokio::runtime::task::core::CoreStage<T>::poll::{{closure}}
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/core.rs:161:17
  28: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/loom/std/unsafe_cell.rs:14:9
  29: tokio::runtime::task::core::CoreStage<T>::poll
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/core.rs:151:13
  30: tokio::runtime::task::harness::poll_future::{{closure}}
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/harness.rs:461:19
  31: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/core/src/panic/unwind_safe.rs:271:9
  32: std::panicking::try::do_call
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/std/src/panicking.rs:406:40
  33: <unknown>
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/datafusion-6.0.0/src/physical_plan/distinct_expressions.rs:127:15
  34: std::panicking::try
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/std/src/panicking.rs:370:19
  35: std::panic::catch_unwind
             at /rustc/65c55bf931a55e6b1e5ed14ad8623814a7386424/library/std/src/panic.rs:133:14
  36: tokio::runtime::task::harness::poll_future
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/harness.rs:449:18
  37: tokio::runtime::task::harness::Harness<T,S>::poll_inner
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/harness.rs:98:27
  38: tokio::runtime::task::harness::Harness<T,S>::poll
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/harness.rs:53:15
  39: tokio::runtime::task::raw::poll
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/raw.rs:113:5
  40: tokio::runtime::task::raw::RawTask::poll
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/raw.rs:70:18
  41: tokio::runtime::task::UnownedTask<S>::run
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/task/mod.rs:379:9
  42: tokio::runtime::blocking::pool::Inner::run
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/blocking/pool.rs:264:17
  43: tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}
             at /Users/xxxx/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.14.0/src/runtime/blocking/pool.rs:244:17

Expected behavior
To be able to read that parquet file. The parquet file can be read with parquet-tools CLI and Apache Spark.

Additional context
The root cause is the fact that the parquet file contains some rows with 9999-12-31 02:00:00 in the dimension_load_date column. This future date is supported by Parquet and Spark.

The content of the parquet file is:

+------------+------------------+------------------+-------------------+
|licence_code|vehicle_make      |fuel_type         |dimension_load_date|
+------------+------------------+------------------+-------------------+
|odc-odbl    |**Not Provided**  |**Not Provided**  |9999-12-31 02:00:00|
|odc-odbl    |**Not Applicable**|**Not Applicable**|9998-12-31 02:00:00|
|odc-odbl    |SAVIEM            |Petrol            |2021-06-09 03:02:37|
|odc-odbl    |YAMAHA            |Petrol            |2021-06-09 03:43:47|
|odc-odbl    |VAUXHALL          |Petrol            |2020-10-18 03:23:47|
|odc-odbl    |VAUXHALL          |Petrol            |2021-06-09 03:02:37|
|odc-odbl    |BMW               |Petrol            |2021-06-09 03:38:39|
|odc-odbl    |MG                |Petrol            |2020-10-18 03:23:47|
|odc-odbl    |PEUGEOT           |Diesel            |2020-10-18 03:35:16|
|odc-odbl    |FORD              |Diesel            |2020-10-18 03:23:47|
|odc-odbl    |FORD              |Petrol            |2020-10-18 03:12:55|
|odc-odbl    |SKODA             |Diesel            |2021-06-09 03:02:37|
|odc-odbl    |SHOGUN            |Diesel            |2020-10-18 03:12:55|
|odc-odbl    |MITSUBISHI        |Diesel            |2021-06-10 01:15:47|
+------------+------------------+------------------+-------------------+

To find out more about how the root cause was detected you can follow apache/datafusion#1359.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions