fixes #39 by using inf #52

JanBenisek · 2021-03-17T16:23:34Z

fixes #39

check the extensive description in the Issue
the main cause is sorting of array with nan, which is not determinitstic
therefore, I am replacing the nan with np.inf/-np.inf (see line 404-419)
additionaly, we need to remove duplicates when returning the bin edges, but we cannot use set
- hence I used list(dict.fromkeys(bin_edges))
- because set() is unordered by definition, so resulting list will have different order.
I was also formatting the code based on pycodestyle
there are already unit tests for behaviour of this module, so I only added one example with inf data

JanBenisek · 2021-03-19T14:47:29Z

I made second commit where I added log.warning() if variables is 99% nan.

sandervh14

Hi Honza,

I carefully checked the info in the issue and the pull request + the code.
I think that there is a remaining problem, related to the nan->inf trick you apply on the bin edges:

        if math.isnan(bin_edges[0]):
            bin_edges[0] = -np.inf

        if math.isnan(bin_edges[-1]):
            bin_edges[-1] = np.inf

This would run into troubles if the dataframe column would initially already contain inf/-inf values.

Consider the following test case:

@pytest.mark.parametrize("n_bins, auto_adapt_bins, data, expected",
                             [
                              (2, False,
                               # Variable contains floats and inf (e.g. "available resources" variable): WORKS
                               pd.DataFrame({"variable": [5.4, 9.3, np.inf]}),
                               [(5.0, 10.0), (10.0, np.inf)])],
                             ids=["variable with floats and inf"])
    def test_fit_column(self, n_bins, auto_adapt_bins, data, expected):
        discretizer = KBinsDiscretizer(n_bins=n_bins,
                                       auto_adapt_bins=auto_adapt_bins)

        actual = discretizer._fit_column(data, column_name="variable")

        assert actual == expected

bin_edges = [5.4, nan, nan] at first, as computed by:

        if self.strategy == "quantile":
            bin_edges = list(data[column_name]
                             .quantile(np.linspace(0, 1, n_bins + 1),
                                       interpolation='linear'))

Note that the number of nans that would be outputted here depend on the chosen number of bins and on the spread of the float numbers that are present in the column. If both -inf and inf would be present in the column values, it even gets more complicated.

Anyway, then the nan->inf trick bin edge trick converts the above bin_edges to bin_edges = [5.4, nan, inf], since it only converts the first and last bin edge to (-)inf if they are nan.

I see two possible solutions - WDYT?

Keeping maximum 1 nan at both ends of the bin_edges array, so discarding all other nans, and then applying the nan->inf trick,
or warning the user about the presence of -inf and/or inf in the column and that the KBinsDiscretizer cannot properly treat this, that the user should impute infs in the column.

Note: I also found a minor bug while trying some stuff out for this pull request, but logged it as a separate issue, since it's only slightly related to the problem solved in this issue/PR.

JanBenisek · 2021-04-02T07:39:06Z

Sander,
I did some testing and found interesting conclusions.

Your example with two bins:

data = pd.DataFrame({"variable": [5.4, 9.3, np.inf]})
discretizer = KBinsDiscretizer(n_bins=2, auto_adapt_bins=False)
actual = discretizer._fit_column(data, column_name="variable")

# This gives indeed [(5.0, nan), (nan, inf)]

I added one extra row:

data = pd.DataFrame({"variable": [5.4, 9.3, 10.2, np.inf]})
discretizer = KBinsDiscretizer(n_bins=2, auto_adapt_bins=False)
actual = discretizer._fit_column(data, column_name="variable")

# Gives [(5.0, 10.0), (10.0, inf)]

After some tinkering, I realized that if the number of floats minus number of np.inf is equal to 1 (for example we have 7 floats and 6 infs), it returns nan in both edges. Not good. This assumes that n_bins=2.

What if we change the n_bins=3 parameter in your original example?

data = pd.DataFrame({"variable": [5.4, 9.3, np.inf]})
discretizer = KBinsDiscretizer(n_bins=3, auto_adapt_bins=False)
actual = discretizer._fit_column(data, column_name="variable")

# This returns two bins (makes sense, we got only two values), but no nans: [(5.0, 8.0), (8.0, inf)]

Let's further play with n_bins. Now we got 7 floats and 6 np.inf.

data = pd.DataFrame({"variable": [5.4, 9.3, 10.2, 11.5, 12.2, 13.3, 14.4,
                                  np.inf, np.inf, np.inf, np.inf, np.inf, np.inf]})
discretizer = KBinsDiscretizer(n_bins=2, auto_adapt_bins=False)
actual = discretizer._fit_column(data, column_name="variable")

# With n_bins=2, we get [(5.0, nan), (nan, inf)]
# With n_bins=3, we get [(5.0, 12.0), (12.0, inf)]
# With n_bins=4, we get [(5.0, 12.0), (12.0, nan), (nan, nan), (nan, inf)]
# With n_bins=5, we get [(5.0, 11.0), (11.0, 13.0), (13.0, inf)]
# With n_bins=6, we get [(5.0, 10.0), (10.0, 12.0), (12.0, nan), (nan, inf), (inf, nan)]

Again, in cases where the difference between number of floats and number of np.infis bigger than 1, there are no nan in the output:

(8 floats, 6 np.inf)
# With n_bins=2, we get [(5.0, 16.0), (16.0, inf)]
# With n_bins=3, we get [(5.0, 13.0), (13.0, inf)]
# With n_bins=4, we get [(5.0, 12.0), (12.0, 16.0), (16.0, inf)]
# With n_bins=5, we get [(5.0, 11.0), (11.0, 14.0), (14.0, inf)]
# With n_bins=6, we get [(5.0, 10.0), (10.0, 13.0), (13.0, 16.0), (16.0, inf)]

Lastly, if i replace the np.inf with np.nan, everything works as it should.

Therefore, I am in favor of your second suggestion - if there are any np.inf, inform user to treat this and skip the variable in the process. I'd rather test input and warn user than doing some replacing behind which the users might not appreciate.

# Conflicts: # cobra/preprocessing/kbins_discretizer.py

sandervh14

OK, it seems we agree on the strategy to warn the user then. :-)
I checked the code, the tests run successfully and the dev branch merge conflicts are resolved. Let's go for it!

fixes #39 by using inf

b66a0c4

JanBenisek added this to the v1.0.2 milestone Mar 17, 2021

JanBenisek requested a review from sandervh14 March 17, 2021 16:23

JanBenisek linked an issue Mar 17, 2021 that may be closed by this pull request

Discretizer gives error with NaNs #39

Closed

Fixes #39 - added warning

6764bc6

sandervh14 suggested changes Mar 31, 2021

View reviewed changes

JanBenisek and others added 3 commits April 2, 2021 09:55

Warned and skipped inf var #39

c8582ac

Warned and skipped inf var #39

9eb1afd

Merge branch 'develop' into fix/discretizer_inf

4e4aeea

# Conflicts: # cobra/preprocessing/kbins_discretizer.py

sandervh14 approved these changes Apr 16, 2021

View reviewed changes

sandervh14 merged commit 4e4aeea into develop Apr 16, 2021

sandervh14 deleted the fix/discretizer_inf branch April 16, 2021 16:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixes #39 by using inf #52

fixes #39 by using inf #52

Uh oh!

JanBenisek commented Mar 17, 2021

Uh oh!

JanBenisek commented Mar 19, 2021

Uh oh!

sandervh14 left a comment

Uh oh!

JanBenisek commented Apr 2, 2021

Uh oh!

sandervh14 left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fixes #39 by using inf #52

fixes #39 by using inf #52

Uh oh!

Conversation

JanBenisek commented Mar 17, 2021

Uh oh!

JanBenisek commented Mar 19, 2021

Uh oh!

sandervh14 left a comment

Choose a reason for hiding this comment

Uh oh!

JanBenisek commented Apr 2, 2021

Uh oh!

sandervh14 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sandervh14 left a comment •

edited

Loading