-
Notifications
You must be signed in to change notification settings - Fork 7
fixes #39 by using inf #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I made second commit where I added |
sandervh14
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Honza,
I carefully checked the info in the issue and the pull request + the code.
I think that there is a remaining problem, related to the nan->inf trick you apply on the bin edges:
if math.isnan(bin_edges[0]):
bin_edges[0] = -np.inf
if math.isnan(bin_edges[-1]):
bin_edges[-1] = np.inf
This would run into troubles if the dataframe column would initially already contain inf/-inf values.
Consider the following test case:
@pytest.mark.parametrize("n_bins, auto_adapt_bins, data, expected",
[
(2, False,
# Variable contains floats and inf (e.g. "available resources" variable): WORKS
pd.DataFrame({"variable": [5.4, 9.3, np.inf]}),
[(5.0, 10.0), (10.0, np.inf)])],
ids=["variable with floats and inf"])
def test_fit_column(self, n_bins, auto_adapt_bins, data, expected):
discretizer = KBinsDiscretizer(n_bins=n_bins,
auto_adapt_bins=auto_adapt_bins)
actual = discretizer._fit_column(data, column_name="variable")
assert actual == expected
bin_edges = [5.4, nan, nan] at first, as computed by:
if self.strategy == "quantile":
bin_edges = list(data[column_name]
.quantile(np.linspace(0, 1, n_bins + 1),
interpolation='linear'))
Note that the number of nans that would be outputted here depend on the chosen number of bins and on the spread of the float numbers that are present in the column. If both -inf and inf would be present in the column values, it even gets more complicated.
Anyway, then the nan->inf trick bin edge trick converts the above bin_edges to bin_edges = [5.4, nan, inf], since it only converts the first and last bin edge to (-)inf if they are nan.
I see two possible solutions - WDYT?
- Keeping maximum 1 nan at both ends of the bin_edges array, so discarding all other nans, and then applying the nan->inf trick,
- or warning the user about the presence of -inf and/or inf in the column and that the KBinsDiscretizer cannot properly treat this, that the user should impute infs in the column.
Note: I also found a minor bug while trying some stuff out for this pull request, but logged it as a separate issue, since it's only slightly related to the problem solved in this issue/PR.
|
Sander, Your example with two bins: I added one extra row: After some tinkering, I realized that if the number of floats minus number of What if we change the Let's further play with Again, in cases where the difference between number of floats and number of Lastly, if i replace the Therefore, I am in favor of your second suggestion - if there are any |
# Conflicts: # cobra/preprocessing/kbins_discretizer.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, it seems we agree on the strategy to warn the user then. :-)
I checked the code, the tests run successfully and the dev branch merge conflicts are resolved. Let's go for it!
fixes #39
nan, which is not determinitsticnanwithnp.inf/-np.inf(see line 404-419)list(dict.fromkeys(bin_edges))set()is unordered by definition, so resulting list will have different order.infdata