[SPARK-30788][SQL] Support SimpleDateFormat and FastDateFormat as legacy date/timestamp formatters#27524
[SPARK-30788][SQL] Support SimpleDateFormat and FastDateFormat as legacy date/timestamp formatters#27524MaxGekk wants to merge 11 commits into
SimpleDateFormat and FastDateFormat as legacy date/timestamp formatters#27524Conversation
| df.select(unix_timestamp(col("ss")).cast("timestamp"))) | ||
| checkAnswer(df.select(to_timestamp(col("ss"))), Seq( | ||
| Row(ts1), Row(ts2))) | ||
| if (legacyParser) { |
There was a problem hiding this comment.
I had to handle legacy mode especially due to behavior change of to_timestamp
There was a problem hiding this comment.
Unfortunately, SimpleDateFormat doesn't work correctly with the pattern .S. In Spark 2.4, it wasn't visible in the test because to_timestamp truncated results to seconds.
There was a problem hiding this comment.
@cloud-fan Only here, I have to modify the test to adopt it for the legacy parser.
|
Test build #118172 has finished for PR 27524 at commit
|
|
Test build #118175 has finished for PR 27524 at commit
|
|
@cloud-fan @HyukjinKwon Please, look at the draft PR. |
| @@ -1,2 +1,2 @@ | |||
| "good record",1999-08-01 | |||
| "bad record",1999-088-01 | |||
| "bad record",1999-088_01 | |||
There was a problem hiding this comment.
I had to change this because FastDateFormat is not so strong, and can parse 1999-088-01
There was a problem hiding this comment.
do we run these tests with legacy formatter?
There was a problem hiding this comment.
Yes, I added CSVLegacyTimeParserSuite which runs entire CSVSuite with the legacy parser.
| @@ -1,2 +1,2 @@ | |||
| 0,2013-111-11 12:13:14 | |||
| 0,2013-111_11 12:13:14 | |||
There was a problem hiding this comment.
2013-111-11 is valid for FastDateFormat
| * Also this class allows to set raw value to the `MILLISECOND` field | ||
| * directly before formatting. | ||
| */ | ||
| class MicrosCalendar(tz: TimeZone, digitsInFraction: Int) |
There was a problem hiding this comment.
This is a copy-paste from 2.4
|
Approach seems okay. |
|
Test build #118185 has finished for PR 27524 at commit
|
|
Test build #118213 has finished for PR 27524 at commit
|
| val MAX_LONG_DIGITS = 18 | ||
|
|
||
| private val POW_10 = Array.tabulate[Long](MAX_LONG_DIGITS + 1)(i => math.pow(10, i).toLong) | ||
| val POW_10 = Array.tabulate[Long](MAX_LONG_DIGITS + 1)(i => math.pow(10, i).toLong) |
There was a problem hiding this comment.
POW_10 is needed in the wrapper of FastDateFormat to support parsing/formatting in microsecond precision. Similar changes were made in Spark 2.4.
|
Test build #118215 has finished for PR 27524 at commit
|
|
jenkins, retest this, please |
| import org.apache.spark.sql.internal.SQLConf | ||
| import org.apache.spark.sql.types._ | ||
| import org.apache.spark.unsafe.types.{CalendarInterval, UTF8String} | ||
|
|
There was a problem hiding this comment.
all the tests in this file are not affected by the new or legacy formatter?
There was a problem hiding this comment.
I wrapped the tests that are affected by:
Seq(false, true).foreach { legacyParser =>
withSQLConf(SQLConf.LEGACY_TIME_PARSER_ENABLED.key -> legacyParser.toString) {
}
}There was a problem hiding this comment.
They work fine with SimpleDateFormat and lenient = false.
SimpleDateFormat and FastDateFormat as legacy date/timestamp formattersSimpleDateFormat and FastDateFormat as legacy date/timestamp formatters
|
Test build #118219 has finished for PR 27524 at commit
|
|
jenkins, retest this, please |
|
Test build #118232 has finished for PR 27524 at commit
|
|
thanks, merging to master/3.0! |
… legacy date/timestamp formatters ### What changes were proposed in this pull request? In the PR, I propose to add legacy date/timestamp formatters based on `SimpleDateFormat` and `FastDateFormat`: - `LegacyFastTimestampFormatter` - uses `FastDateFormat` and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see #26507 & #26582 - `LegacySimpleTimestampFormatter` uses `SimpleDateFormat`, and support the `lenient` mode. When the `lenient` parameter is set to `false`, the parser become much stronger in checking its input. ### Why are the changes needed? Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings: - `DateTimeFormat` in CSV/JSON datasource - `SimpleDateFormat` - is used in JDBC datasource, in partitions parsing. - `SimpleDateFormat` in strong mode (`lenient = false`), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the `date_format`, `from_unixtime`, `unix_timestamp` and `to_unix_timestamp` functions. The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when `spark.sql.legacy.timeParser.enabled` is set to `true`. ### Does this PR introduce any user-facing change? This shouldn't change behavior with default settings. If `spark.sql.legacy.timeParser.enabled` is set to `true`, users should observe behavior of Spark 2.4. ### How was this patch tested? - Modified tests in `DateExpressionsSuite` to check the legacy parser - `SimpleDateFormat`. - Added `CSVLegacyTimeParserSuite` and `JsonLegacyTimeParserSuite` to run `CSVSuite` and `JsonSuite` with the legacy parser - `FastDateFormat`. Closes #27524 from MaxGekk/timestamp-formatter-legacy-fallback. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit c198620) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
… legacy date/timestamp formatters ### What changes were proposed in this pull request? In the PR, I propose to add legacy date/timestamp formatters based on `SimpleDateFormat` and `FastDateFormat`: - `LegacyFastTimestampFormatter` - uses `FastDateFormat` and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see apache#26507 & apache#26582 - `LegacySimpleTimestampFormatter` uses `SimpleDateFormat`, and support the `lenient` mode. When the `lenient` parameter is set to `false`, the parser become much stronger in checking its input. ### Why are the changes needed? Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings: - `DateTimeFormat` in CSV/JSON datasource - `SimpleDateFormat` - is used in JDBC datasource, in partitions parsing. - `SimpleDateFormat` in strong mode (`lenient = false`), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the `date_format`, `from_unixtime`, `unix_timestamp` and `to_unix_timestamp` functions. The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when `spark.sql.legacy.timeParser.enabled` is set to `true`. ### Does this PR introduce any user-facing change? This shouldn't change behavior with default settings. If `spark.sql.legacy.timeParser.enabled` is set to `true`, users should observe behavior of Spark 2.4. ### How was this patch tested? - Modified tests in `DateExpressionsSuite` to check the legacy parser - `SimpleDateFormat`. - Added `CSVLegacyTimeParserSuite` and `JsonLegacyTimeParserSuite` to run `CSVSuite` and `JsonSuite` with the legacy parser - `FastDateFormat`. Closes apache#27524 from MaxGekk/timestamp-formatter-legacy-fallback. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
In the PR, I propose to add legacy date/timestamp formatters based on
SimpleDateFormatandFastDateFormat:LegacyFastTimestampFormatter- usesFastDateFormatand supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see [SPARK-29904][SQL][2.4] Parse timestamps in microsecond precision by JSON/CSV datasources #26507 & [SPARK-29949][SQL][2.4] Fix formatting of timestamps by JSON/CSV datasources #26582LegacySimpleTimestampFormatterusesSimpleDateFormat, and support thelenientmode. When thelenientparameter is set tofalse, the parser become much stronger in checking its input.Why are the changes needed?
Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings:
DateTimeFormatin CSV/JSON datasourceSimpleDateFormat- is used in JDBC datasource, in partitions parsing.SimpleDateFormatin strong mode (lenient = false), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by thedate_format,from_unixtime,unix_timestampandto_unix_timestampfunctions.The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when
spark.sql.legacy.timeParser.enabledis set totrue.Does this PR introduce any user-facing change?
This shouldn't change behavior with default settings. If
spark.sql.legacy.timeParser.enabledis set totrue, users should observe behavior of Spark 2.4.How was this patch tested?
DateExpressionsSuiteto check the legacy parser -SimpleDateFormat.CSVLegacyTimeParserSuiteandJsonLegacyTimeParserSuiteto runCSVSuiteandJsonSuitewith the legacy parser -FastDateFormat.