-
Notifications
You must be signed in to change notification settings - Fork 13
Closed
Description
Currently histogrammer is unable to classify timestamp/date column of a Spark DataFrame as timestamp. The timestamp column is getting classified as number. The histogrammer does binning calculations by converting timestamps into nanoseconds. However, while converting the nanosecond back to timestamp it fails for Spark DataFrames as the nanoseconds are stored as float or spark datatypes whereas in Pandas DataFrame it is stored as np.float or np.int. This can be solved by fixing the following block of code in util.py:
def _is_probable_timestamp(value, DATE_LOW=5e16, DATE_HIGH=9.9e18):
"""function to check if input number is probably a timestamp in nanoseconds
:param value: input value
:return: True if timestamp
"""
import numpy as np
# HACK: making an educated guess for timestamp
# large numbers (time in ns since 1970) used to determine if float corresponds to a timestamp
# DATE_LOW = 5e16 = 1971-08-02 16:53:20 in nanosec
# DATE_HIGH = 9.9e18 = 2260-1-1 in nanosec
# timestamp is in ns since 1970, so a huge number.
is_ts = False
# should be -> if (isinstance(value, np.number) or instance(value,float)) and not np.isnan(value):
if isinstance(value, np.number) and not np.isnan(value):
is_ts = DATE_LOW < value < DATE_HIGH
return is_ts
Due to this bug, we can see nanoseconds as time-axis in popmon report graphs (mentioned in the issue here).
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels