Wrong datatype inference of timestamp columns for Spark DataFrames

Currently histogrammer is unable to classify timestamp/date column of a Spark DataFrame as timestamp. The timestamp column is getting classified as number. The histogrammer does binning calculations by converting  timestamps into nanoseconds. However, while converting the nanosecond back to timestamp it fails for Spark DataFrames as the nanoseconds are stored as float or spark datatypes whereas in Pandas DataFrame it is stored as np.float or np.int. This can be solved by fixing the following block of code in util.py: 


```
def _is_probable_timestamp(value, DATE_LOW=5e16, DATE_HIGH=9.9e18):
"""function to check if input number is probably a timestamp in nanoseconds    
:param value: input value
    :return: True if timestamp
    """
    import numpy as np
    # HACK: making an educated guess for timestamp
    # large numbers (time in ns since 1970) used to determine if float corresponds to a timestamp
    # DATE_LOW = 5e16     = 1971-08-02 16:53:20 in nanosec
    # DATE_HIGH = 9.9e18  = 2260-1-1 in nanosec

    # timestamp is in ns since 1970, so a huge number.
    is_ts = False
    # should be -> if (isinstance(value, np.number) or instance(value,float)) and not np.isnan(value):
    if isinstance(value, np.number) and not np.isnan(value): 
        is_ts = DATE_LOW < value < DATE_HIGH
    return is_ts
```

Due to this bug, we can see nanoseconds as time-axis in popmon report graphs (mentioned in the issue [here](https://github.com/ing-bank/popmon/issues/233)).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong datatype inference of timestamp columns for Spark DataFrames #58

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Wrong datatype inference of timestamp columns for Spark DataFrames #58

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions