[SPARK-17495] [SQL] Support Decimal type in Hive-hash#17056
[SPARK-17495] [SQL] Support Decimal type in Hive-hash#17056tejasapatil wants to merge 7 commits into
Conversation
|
ok to test |
|
Test build #73410 has started for PR 17056 at commit |
## What changes were proposed in this pull request? This PR adds tests hive-hash by comparing the outputs generated against Hive 1.2.1. Following datatypes are covered by this PR: - null - boolean - byte - short - int - long - float - double - string - array - map - struct Datatypes that I have _NOT_ covered but I will work on separately are: - Decimal (handled separately in #17056) - TimestampType - DateType - CalendarIntervalType ## How was this patch tested? NA Author: Tejas Patil <tejasp@fb.com> Closes #17049 from tejasapatil/SPARK-17495_remaining_types.
a378b3e to
8595305
Compare
|
retest this please |
|
Test build #73460 has finished for PR 17056 at commit
|
|
@cloud-fan @gatorsmile : can you please review this PR ? |
There was a problem hiding this comment.
A quick question: these expected values are got from Hive?
There was a problem hiding this comment.
I did a quick check. Most are right, but some of them do not match
There was a problem hiding this comment.
These were generated over Hive 1.2.1.
There was a problem hiding this comment.
hive> select HASH(CAST("-18446744073709001000" AS DECIMAL(38,19)));
OK
0
Time taken: 0.035 seconds, Fetched: 1 row(s)
There was a problem hiding this comment.
I figured out the problem was with the test case being not looking at the result of decimal.changePrecision. Fixed.
## What changes were proposed in this pull request? This PR adds tests hive-hash by comparing the outputs generated against Hive 1.2.1. Following datatypes are covered by this PR: - null - boolean - byte - short - int - long - float - double - string - array - map - struct Datatypes that I have _NOT_ covered but I will work on separately are: - Decimal (handled separately in apache#17056) - TimestampType - DateType - CalendarIntervalType ## How was this patch tested? NA Author: Tejas Patil <tejasp@fb.com> Closes apache#17049 from tejasapatil/SPARK-17495_remaining_types.
|
Jenkins test this please |
|
Test build #73595 has finished for PR 17056 at commit
|
|
Test build #73596 has finished for PR 17056 at commit
|
There was a problem hiding this comment.
where do we use ctx and d?
There was a problem hiding this comment.
They both aren't used but are a part of the method signature since the default impl in abstract class needs those :
There was a problem hiding this comment.
${classOf[HiveHashFunction].getName}.normalizeDecimal
There was a problem hiding this comment.
HiveHashFunction is an object so cannot do classOf[]. Tried HiveHashFunction.getClass.getName as per other places in the codebase
There was a problem hiding this comment.
hmm it's hard to guarantee that we can produce same hash value as hive, can we run hive in the test and compare the result with spark?
There was a problem hiding this comment.
The expected values are generated using hive 1.2.1. My original approach was to depend on Hive for generating expected values but as per discussion in a related PR, I was suggested to hardcode expected values. The main point being reduce dependency on Hive
|
Test build #73610 has finished for PR 17056 at commit
|
There was a problem hiding this comment.
Nit: if (input == null) return null
|
Let me manually check whether the results are consistent with Hive. |
428a9a4 to
c0c8390
Compare
|
@gatorsmile : I really appreciate your help in reviewing this PR to the extent that you are manually checking the hashes over Hive. If you haven't already embarked on that, here is the set of hive queries corresponding to the test case in the PR which you can easily copy paste: |
|
Thank you! I checked Hive 2.1. It has the exactly same hash values. |
|
LGTM. cc @cloud-fan for final signing-off |
|
Test build #73670 has finished for PR 17056 at commit
|
|
@cloud-fan ping !! |
| private val HiveDecimalMaxScale = 38 | ||
|
|
||
| // Mimics normalization done for decimals in Hive at HiveDecimalV1.normalize() | ||
| def normalizeDecimal(input: BigDecimal, allowRounding: Boolean): BigDecimal = { |
There was a problem hiding this comment.
allowRounding will never be false?
There was a problem hiding this comment.
removed that param
| HiveHasher.hashUnsafeBytes(base, offset, len) | ||
| } | ||
|
|
||
| private val HiveDecimalMaxPrecision = 38 |
There was a problem hiding this comment.
nit: HIVE_DECIMAL_MAX_PRECISION
|
Test build #73945 has finished for PR 17056 at commit
|
|
LGTM(if tests pass) |
|
Test build #74018 has finished for PR 17056 at commit
|
|
thanks, merging to master! |
What changes were proposed in this pull request?
Hive hash to support Decimal datatype. Hive internally normalises decimals and I have ported that logic as-is to HiveHash.
How was this patch tested?
Added unit tests