[SPARK-18877][SQL][BACKPORT-2.1] `CSVInferSchema.inferField` on DecimalType should find a common type with `typeSoFar` by dongjoon-hyun · Pull Request #16463 · apache/spark

dongjoon-hyun · 2017-01-03T21:08:41Z

What changes were proposed in this pull request?

CSV type inferencing causes IllegalArgumentException on decimal numbers with heterogeneous precisions and scales because the current logic uses the last decimal type in a partition. Specifically, inferRowType, the seqOp of aggregate, returns the last decimal type. This PR fixes it to use findTightestCommonType.

decimal.csv

9.03E+12
1.19E+11

BEFORE

scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema
root
 |-- _c0: decimal(3,-9) (nullable = true)

scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show
16/12/16 14:32:49 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4)
java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 exceeds max precision 3

AFTER

scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema
root
 |-- _c0: decimal(4,-9) (nullable = true)

scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show
+---------+
|      _c0|
+---------+
|9.030E+12|
| 1.19E+11|
+---------+

How was this patch tested?

Pass the newly add test case.

…alType should find a common type with `typeSoFar` CSV type inferencing causes `IllegalArgumentException` on decimal numbers with heterogeneous precisions and scales because the current logic uses the last decimal type in a **partition**. Specifically, `inferRowType`, the **seqOp** of **aggregate**, returns the last decimal type. This PR fixes it to use `findTightestCommonType`. **decimal.csv** ``` 9.03E+12 1.19E+11 ``` **BEFORE** ```scala scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema root |-- _c0: decimal(3,-9) (nullable = true) scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show 16/12/16 14:32:49 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4) java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 exceeds max precision 3 ``` **AFTER** ```scala scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema root |-- _c0: decimal(4,-9) (nullable = true) scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show +---------+ | _c0| +---------+ |9.030E+12| | 1.19E+11| +---------+ ``` Pass the newly add test case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16320 from dongjoon-hyun/SPARK-18877.

SparkQA · 2017-01-03T23:31:49Z

Test build #70829 has finished for PR 16463 at commit 7d4f31e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-01-04T00:08:14Z

Hi, @gatorsmile .
This is a backport of #16320 .
If backporting against branch-2.0 is possible, please let me know too.
Thank you!

dongjoon-hyun · 2017-01-04T16:32:20Z

Thank you for review, @jaceklaskowski .

…lType should find a common type with `typeSoFar` ## What changes were proposed in this pull request? CSV type inferencing causes `IllegalArgumentException` on decimal numbers with heterogeneous precisions and scales because the current logic uses the last decimal type in a **partition**. Specifically, `inferRowType`, the **seqOp** of **aggregate**, returns the last decimal type. This PR fixes it to use `findTightestCommonType`. **decimal.csv** ``` 9.03E+12 1.19E+11 ``` **BEFORE** ```scala scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema root |-- _c0: decimal(3,-9) (nullable = true) scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show 16/12/16 14:32:49 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4) java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 exceeds max precision 3 ``` **AFTER** ```scala scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema root |-- _c0: decimal(4,-9) (nullable = true) scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show +---------+ | _c0| +---------+ |9.030E+12| | 1.19E+11| +---------+ ``` ## How was this patch tested? Pass the newly add test case. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16463 from dongjoon-hyun/SPARK-18877-BACKPORT-21.

gatorsmile · 2017-01-04T17:57:38Z

LGTM. Thanks, merging to 2.1!

gatorsmile · 2017-01-04T17:58:12Z

Could you please close it and open one for branch 2.0? Thanks!

dongjoon-hyun · 2017-01-04T18:05:29Z

Sure! Thank you, @gatorsmile .

jaceklaskowski approved these changes Jan 4, 2017

View reviewed changes

dongjoon-hyun closed this Jan 4, 2017

dongjoon-hyun deleted the SPARK-18877-BACKPORT-21 branch January 6, 2017 18:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-18877][SQL][BACKPORT-2.1] `CSVInferSchema.inferField` on DecimalType should find a common type with `typeSoFar`#16463

[SPARK-18877][SQL][BACKPORT-2.1] `CSVInferSchema.inferField` on DecimalType should find a common type with `typeSoFar`#16463
dongjoon-hyun wants to merge 1 commit into
apache:branch-2.1from
dongjoon-hyun:SPARK-18877-BACKPORT-21

dongjoon-hyun commented Jan 3, 2017

Uh oh!

SparkQA commented Jan 3, 2017

Uh oh!

dongjoon-hyun commented Jan 4, 2017 •

edited

Loading

Uh oh!

dongjoon-hyun commented Jan 4, 2017

Uh oh!

gatorsmile commented Jan 4, 2017

Uh oh!

gatorsmile commented Jan 4, 2017

Uh oh!

dongjoon-hyun commented Jan 4, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dongjoon-hyun commented Jan 3, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jan 3, 2017

Uh oh!

dongjoon-hyun commented Jan 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Jan 4, 2017

Uh oh!

gatorsmile commented Jan 4, 2017

Uh oh!

gatorsmile commented Jan 4, 2017

Uh oh!

dongjoon-hyun commented Jan 4, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dongjoon-hyun commented Jan 4, 2017 •

edited

Loading