Skip to content

Conversation

@JanBenisek
Copy link
Contributor

fixes #39

  • check the extensive description in the Issue
  • the main cause is sorting of array with nan, which is not determinitstic
  • therefore, I am replacing the nan with np.inf/-np.inf (see line 404-419)
  • additionaly, we need to remove duplicates when returning the bin edges, but we cannot use set
    • hence I used list(dict.fromkeys(bin_edges))
    • because set() is unordered by definition, so resulting list will have different order.
  • I was also formatting the code based on pycodestyle
  • there are already unit tests for behaviour of this module, so I only added one example with inf data

@JanBenisek JanBenisek added this to the v1.0.2 milestone Mar 17, 2021
@JanBenisek JanBenisek requested a review from sandervh14 March 17, 2021 16:23
@JanBenisek JanBenisek linked an issue Mar 17, 2021 that may be closed by this pull request
@JanBenisek
Copy link
Contributor Author

I made second commit where I added log.warning() if variables is 99% nan.

Copy link
Contributor

@sandervh14 sandervh14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Honza,

I carefully checked the info in the issue and the pull request + the code.
I think that there is a remaining problem, related to the nan->inf trick you apply on the bin edges:

        if math.isnan(bin_edges[0]):
            bin_edges[0] = -np.inf

        if math.isnan(bin_edges[-1]):
            bin_edges[-1] = np.inf

This would run into troubles if the dataframe column would initially already contain inf/-inf values.

Consider the following test case:

@pytest.mark.parametrize("n_bins, auto_adapt_bins, data, expected",
                             [
                              (2, False,
                               # Variable contains floats and inf (e.g. "available resources" variable): WORKS
                               pd.DataFrame({"variable": [5.4, 9.3, np.inf]}),
                               [(5.0, 10.0), (10.0, np.inf)])],
                             ids=["variable with floats and inf"])
    def test_fit_column(self, n_bins, auto_adapt_bins, data, expected):
        discretizer = KBinsDiscretizer(n_bins=n_bins,
                                       auto_adapt_bins=auto_adapt_bins)

        actual = discretizer._fit_column(data, column_name="variable")

        assert actual == expected

bin_edges = [5.4, nan, nan] at first, as computed by:

        if self.strategy == "quantile":
            bin_edges = list(data[column_name]
                             .quantile(np.linspace(0, 1, n_bins + 1),
                                       interpolation='linear'))

Note that the number of nans that would be outputted here depend on the chosen number of bins and on the spread of the float numbers that are present in the column. If both -inf and inf would be present in the column values, it even gets more complicated.

Anyway, then the nan->inf trick bin edge trick converts the above bin_edges to bin_edges = [5.4, nan, inf], since it only converts the first and last bin edge to (-)inf if they are nan.

I see two possible solutions - WDYT?

  • Keeping maximum 1 nan at both ends of the bin_edges array, so discarding all other nans, and then applying the nan->inf trick,
  • or warning the user about the presence of -inf and/or inf in the column and that the KBinsDiscretizer cannot properly treat this, that the user should impute infs in the column.

Note: I also found a minor bug while trying some stuff out for this pull request, but logged it as a separate issue, since it's only slightly related to the problem solved in this issue/PR.

@JanBenisek
Copy link
Contributor Author

Sander,
I did some testing and found interesting conclusions.

Your example with two bins:

data = pd.DataFrame({"variable": [5.4, 9.3, np.inf]})
discretizer = KBinsDiscretizer(n_bins=2, auto_adapt_bins=False)
actual = discretizer._fit_column(data, column_name="variable")

# This gives indeed [(5.0, nan), (nan, inf)]

I added one extra row:

data = pd.DataFrame({"variable": [5.4, 9.3, 10.2, np.inf]})
discretizer = KBinsDiscretizer(n_bins=2, auto_adapt_bins=False)
actual = discretizer._fit_column(data, column_name="variable")

# Gives [(5.0, 10.0), (10.0, inf)]

After some tinkering, I realized that if the number of floats minus number of np.inf is equal to 1 (for example we have 7 floats and 6 infs), it returns nan in both edges. Not good. This assumes that n_bins=2.

What if we change the n_bins=3 parameter in your original example?

data = pd.DataFrame({"variable": [5.4, 9.3, np.inf]})
discretizer = KBinsDiscretizer(n_bins=3, auto_adapt_bins=False)
actual = discretizer._fit_column(data, column_name="variable")

# This returns two bins (makes sense, we got only two values), but no nans: [(5.0, 8.0), (8.0, inf)]

Let's further play with n_bins. Now we got 7 floats and 6 np.inf.

data = pd.DataFrame({"variable": [5.4, 9.3, 10.2, 11.5, 12.2, 13.3, 14.4,
                                  np.inf, np.inf, np.inf, np.inf, np.inf, np.inf]})
discretizer = KBinsDiscretizer(n_bins=2, auto_adapt_bins=False)
actual = discretizer._fit_column(data, column_name="variable")

# With n_bins=2, we get [(5.0, nan), (nan, inf)]
# With n_bins=3, we get [(5.0, 12.0), (12.0, inf)]
# With n_bins=4, we get [(5.0, 12.0), (12.0, nan), (nan, nan), (nan, inf)]
# With n_bins=5, we get [(5.0, 11.0), (11.0, 13.0), (13.0, inf)]
# With n_bins=6, we get [(5.0, 10.0), (10.0, 12.0), (12.0, nan), (nan, inf), (inf, nan)]

Again, in cases where the difference between number of floats and number of np.infis bigger than 1, there are no nan in the output:

(8 floats, 6 np.inf)
# With n_bins=2, we get [(5.0, 16.0), (16.0, inf)]
# With n_bins=3, we get [(5.0, 13.0), (13.0, inf)]
# With n_bins=4, we get [(5.0, 12.0), (12.0, 16.0), (16.0, inf)]
# With n_bins=5, we get [(5.0, 11.0), (11.0, 14.0), (14.0, inf)]
# With n_bins=6, we get [(5.0, 10.0), (10.0, 13.0), (13.0, 16.0), (16.0, inf)]

Lastly, if i replace the np.inf with np.nan, everything works as it should.

Therefore, I am in favor of your second suggestion - if there are any np.inf, inform user to treat this and skip the variable in the process. I'd rather test input and warn user than doing some replacing behind which the users might not appreciate.

Copy link
Contributor

@sandervh14 sandervh14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it seems we agree on the strategy to warn the user then. :-)
I checked the code, the tests run successfully and the dev branch merge conflicts are resolved. Let's go for it!

@sandervh14 sandervh14 merged commit 4e4aeea into develop Apr 16, 2021
@sandervh14 sandervh14 deleted the fix/discretizer_inf branch April 16, 2021 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Discretizer gives error with NaNs

3 participants