[SPARK-30788][SQL] Support `SimpleDateFormat` and `FastDateFormat` as legacy date/timestamp formatters by MaxGekk · Pull Request #27524 · apache/spark

MaxGekk · 2020-02-10T16:11:00Z

What changes were proposed in this pull request?

In the PR, I propose to add legacy date/timestamp formatters based on SimpleDateFormat and FastDateFormat:

LegacyFastTimestampFormatter - uses FastDateFormat and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see [SPARK-29904][SQL][2.4] Parse timestamps in microsecond precision by JSON/CSV datasources #26507 & [SPARK-29949][SQL][2.4] Fix formatting of timestamps by JSON/CSV datasources #26582
LegacySimpleTimestampFormatter uses SimpleDateFormat, and support the lenient mode. When the lenient parameter is set to false, the parser become much stronger in checking its input.

Why are the changes needed?

Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings:

DateTimeFormat in CSV/JSON datasource
SimpleDateFormat - is used in JDBC datasource, in partitions parsing.
SimpleDateFormat in strong mode (lenient = false), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the date_format, from_unixtime, unix_timestamp and to_unix_timestamp functions.

The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when spark.sql.legacy.timeParser.enabled is set to true.

Does this PR introduce any user-facing change?

This shouldn't change behavior with default settings. If spark.sql.legacy.timeParser.enabled is set to true, users should observe behavior of Spark 2.4.

How was this patch tested?

Modified tests in DateExpressionsSuite to check the legacy parser - SimpleDateFormat.
Added CSVLegacyTimeParserSuite and JsonLegacyTimeParserSuite to run CSVSuite and JsonSuite with the legacy parser - FastDateFormat.

MaxGekk · 2020-02-10T16:16:34Z

+          df.select(unix_timestamp(col("ss")).cast("timestamp")))
+        checkAnswer(df.select(to_timestamp(col("ss"))), Seq(
+          Row(ts1), Row(ts2)))
+        if (legacyParser) {


I had to handle legacy mode especially due to behavior change of to_timestamp

Unfortunately, SimpleDateFormat doesn't work correctly with the pattern .S. In Spark 2.4, it wasn't visible in the test because to_timestamp truncated results to seconds.

@cloud-fan Only here, I have to modify the test to adopt it for the legacy parser.

SparkQA · 2020-02-10T20:52:26Z

Test build #118172 has finished for PR 27524 at commit 38d90d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait LegacyTimestampFormatter extends TimestampFormatter
class LegacyFastDateFormatter(
class LegacySimpleDateFormatter(

SparkQA · 2020-02-10T21:46:17Z

Test build #118175 has finished for PR 27524 at commit ab1d57f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait LegacyDateFormatter extends DateFormatter
class LegacyFastDateFormatter(pattern: String, locale: Locale) extends LegacyDateFormatter
class LegacySimpleDateFormatter(pattern: String, locale: Locale) extends LegacyDateFormatter
class LegacyFastTimestampFormatter(
class LegacySimpleTimestampFormatter(

MaxGekk · 2020-02-10T22:28:27Z

@cloud-fan @HyukjinKwon Please, look at the draft PR.

MaxGekk · 2020-02-10T22:30:39Z

@@ -1,2 +1,2 @@
 "good record",1999-08-01
-"bad record",1999-088-01
+"bad record",1999-088_01


I had to change this because FastDateFormat is not so strong, and can parse 1999-088-01

do we run these tests with legacy formatter?

Yes, I added CSVLegacyTimeParserSuite which runs entire CSVSuite with the legacy parser.

MaxGekk · 2020-02-10T22:31:11Z

@@ -1,2 +1,2 @@
-0,2013-111-11 12:13:14
+0,2013-111_11 12:13:14


2013-111-11 is valid for FastDateFormat

MaxGekk · 2020-02-10T22:32:53Z

+ * Also this class allows to set raw value to the `MILLISECOND` field
+ * directly before formatting.
+ */
+class MicrosCalendar(tz: TimeZone, digitsInFraction: Int)


This is a copy-paste from 2.4

HyukjinKwon · 2020-02-11T02:34:40Z

Approach seems okay.

SparkQA · 2020-02-11T03:18:37Z

Test build #118185 has finished for PR 27524 at commit 566170a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-11T07:23:51Z

Test build #118213 has finished for PR 27524 at commit ddf127d.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-02-11T07:38:37Z

  val MAX_LONG_DIGITS = 18

-  private val POW_10 = Array.tabulate[Long](MAX_LONG_DIGITS + 1)(i => math.pow(10, i).toLong)
+  val POW_10 = Array.tabulate[Long](MAX_LONG_DIGITS + 1)(i => math.pow(10, i).toLong)


POW_10 is needed in the wrapper of FastDateFormat to support parsing/formatting in microsecond precision. Similar changes were made in Spark 2.4.

SparkQA · 2020-02-11T08:05:01Z

Test build #118215 has finished for PR 27524 at commit 93f3ae1.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-02-11T08:07:08Z

jenkins, retest this, please

cloud-fan · 2020-02-11T09:20:26Z

+import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types._
 import org.apache.spark.unsafe.types.{CalendarInterval, UTF8String}



all the tests in this file are not affected by the new or legacy formatter?

I wrapped the tests that are affected by:

Seq(false, true).foreach { legacyParser => withSQLConf(SQLConf.LEGACY_TIME_PARSER_ENABLED.key -> legacyParser.toString) { } }

They work fine with SimpleDateFormat and lenient = false.

SparkQA · 2020-02-11T10:28:35Z

Test build #118219 has finished for PR 27524 at commit 93f3ae1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-02-11T10:37:23Z

jenkins, retest this, please

SparkQA · 2020-02-11T16:25:07Z

Test build #118232 has finished for PR 27524 at commit 93f3ae1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-02-12T12:12:51Z

thanks, merging to master/3.0!

… legacy date/timestamp formatters ### What changes were proposed in this pull request? In the PR, I propose to add legacy date/timestamp formatters based on `SimpleDateFormat` and `FastDateFormat`: - `LegacyFastTimestampFormatter` - uses `FastDateFormat` and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see #26507 & #26582 - `LegacySimpleTimestampFormatter` uses `SimpleDateFormat`, and support the `lenient` mode. When the `lenient` parameter is set to `false`, the parser become much stronger in checking its input. ### Why are the changes needed? Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings: - `DateTimeFormat` in CSV/JSON datasource - `SimpleDateFormat` - is used in JDBC datasource, in partitions parsing. - `SimpleDateFormat` in strong mode (`lenient = false`), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the `date_format`, `from_unixtime`, `unix_timestamp` and `to_unix_timestamp` functions. The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when `spark.sql.legacy.timeParser.enabled` is set to `true`. ### Does this PR introduce any user-facing change? This shouldn't change behavior with default settings. If `spark.sql.legacy.timeParser.enabled` is set to `true`, users should observe behavior of Spark 2.4. ### How was this patch tested? - Modified tests in `DateExpressionsSuite` to check the legacy parser - `SimpleDateFormat`. - Added `CSVLegacyTimeParserSuite` and `JsonLegacyTimeParserSuite` to run `CSVSuite` and `JsonSuite` with the legacy parser - `FastDateFormat`. Closes #27524 from MaxGekk/timestamp-formatter-legacy-fallback. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit c198620) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… legacy date/timestamp formatters ### What changes were proposed in this pull request? In the PR, I propose to add legacy date/timestamp formatters based on `SimpleDateFormat` and `FastDateFormat`: - `LegacyFastTimestampFormatter` - uses `FastDateFormat` and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see apache#26507 & apache#26582 - `LegacySimpleTimestampFormatter` uses `SimpleDateFormat`, and support the `lenient` mode. When the `lenient` parameter is set to `false`, the parser become much stronger in checking its input. ### Why are the changes needed? Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings: - `DateTimeFormat` in CSV/JSON datasource - `SimpleDateFormat` - is used in JDBC datasource, in partitions parsing. - `SimpleDateFormat` in strong mode (`lenient = false`), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the `date_format`, `from_unixtime`, `unix_timestamp` and `to_unix_timestamp` functions. The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when `spark.sql.legacy.timeParser.enabled` is set to `true`. ### Does this PR introduce any user-facing change? This shouldn't change behavior with default settings. If `spark.sql.legacy.timeParser.enabled` is set to `true`, users should observe behavior of Spark 2.4. ### How was this patch tested? - Modified tests in `DateExpressionsSuite` to check the legacy parser - `SimpleDateFormat`. - Added `CSVLegacyTimeParserSuite` and `JsonLegacyTimeParserSuite` to run `CSVSuite` and `JsonSuite` with the legacy parser - `FastDateFormat`. Closes apache#27524 from MaxGekk/timestamp-formatter-legacy-fallback. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Support SimpleDateFormat as a legacy parser

38d90d8

MaxGekk commented Feb 10, 2020

View reviewed changes

Support SimpleDateFormat as a legacy date parser

ab1d57f

dongjoon-hyun added the SQL label Feb 10, 2020

MaxGekk added 3 commits February 11, 2020 01:18

Add CSVLegacyTimeParserSuite

d8ddc20

Add JsonLegacyTimeParserSuite

d993d1a

Bug fix: set correct locale

566170a

MaxGekk commented Feb 10, 2020

View reviewed changes

Comment thread sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala Outdated

MaxGekk commented Feb 10, 2020

View reviewed changes

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala Outdated

HyukjinKwon reviewed Feb 11, 2020

View reviewed changes

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala Outdated

HyukjinKwon reviewed Feb 11, 2020

View reviewed changes

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateFormatter.scala Outdated

xuanyuanking mentioned this pull request Feb 11, 2020

[SPARK-31410][SPARK-30668][SQL][FOLLOWUP] Raise exception instead of silent change for new DateFormatter #27537

Closed

MaxGekk added 4 commits February 11, 2020 09:32

Add override

ff52d49

Add empty line at the end of JsonSuite

b14ee41

Remove withStrongLegacy

ddf127d

Add override

ca73925

MaxGekk added 2 commits February 11, 2020 10:29

Remove unused class LegacyFastDateFormat

11394cd

Make Scala code style checker happy

93f3ae1

MaxGekk commented Feb 11, 2020

View reviewed changes

cloud-fan reviewed Feb 11, 2020

View reviewed changes

MaxGekk changed the title ~~[WIP][SQL] Support SimpleDateFormat and FastDateFormat as legacy date/timestamp formatters~~ [SPARK-30788][SQL] Support SimpleDateFormat and FastDateFormat as legacy date/timestamp formatters Feb 11, 2020

cloud-fan closed this in c198620 Feb 12, 2020

MaxGekk deleted the timestamp-formatter-legacy-fallback branch June 5, 2020 19:43

Conversation

MaxGekk commented Feb 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 10, 2020

Uh oh!

SparkQA commented Feb 10, 2020

Uh oh!

Uh oh!

MaxGekk commented Feb 10, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk Feb 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented Feb 11, 2020

Uh oh!

SparkQA commented Feb 11, 2020

Uh oh!

SparkQA commented Feb 11, 2020

Uh oh!

MaxGekk Feb 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 11, 2020

Uh oh!

MaxGekk commented Feb 11, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 11, 2020

Uh oh!

MaxGekk commented Feb 11, 2020

Uh oh!

SparkQA commented Feb 11, 2020

Uh oh!

cloud-fan commented Feb 12, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

MaxGekk commented Feb 10, 2020 •

edited

Loading

MaxGekk Feb 10, 2020 •

edited

Loading

MaxGekk Feb 11, 2020 •

edited

Loading