Skip to content

Wrong datatype inference of timestamp columns for Spark DataFrames #58

@pradyot-09

Description

@pradyot-09

Currently histogrammer is unable to classify timestamp/date column of a Spark DataFrame as timestamp. The timestamp column is getting classified as number. The histogrammer does binning calculations by converting timestamps into nanoseconds. However, while converting the nanosecond back to timestamp it fails for Spark DataFrames as the nanoseconds are stored as float or spark datatypes whereas in Pandas DataFrame it is stored as np.float or np.int. This can be solved by fixing the following block of code in util.py:

def _is_probable_timestamp(value, DATE_LOW=5e16, DATE_HIGH=9.9e18):
"""function to check if input number is probably a timestamp in nanoseconds    
:param value: input value
    :return: True if timestamp
    """
    import numpy as np
    # HACK: making an educated guess for timestamp
    # large numbers (time in ns since 1970) used to determine if float corresponds to a timestamp
    # DATE_LOW = 5e16     = 1971-08-02 16:53:20 in nanosec
    # DATE_HIGH = 9.9e18  = 2260-1-1 in nanosec

    # timestamp is in ns since 1970, so a huge number.
    is_ts = False
    # should be -> if (isinstance(value, np.number) or instance(value,float)) and not np.isnan(value):
    if isinstance(value, np.number) and not np.isnan(value): 
        is_ts = DATE_LOW < value < DATE_HIGH
    return is_ts

Due to this bug, we can see nanoseconds as time-axis in popmon report graphs (mentioned in the issue here).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions