Handling Input to auto pytorch #89
Conversation
| self.check_requirements(X, y) | ||
|
|
||
| self.preprocessor['categorical'] = OE(categories=X['dataset_properties']['categories']) | ||
| self.preprocessor['categorical'] = OE(handle_unknown='use_encoded_value', |
There was a problem hiding this comment.
maybe we dont need an ordinal encoder in the search space as the data has already gone through this phase. In case for a dataset, OneHotEncoding is not preferred, NoEncoder will have the same effect as doing Ordinal encoding.
There was a problem hiding this comment.
Oh no wait, we do not touch the numerical columns. The objective here is to make the data numerical.
So if you have a numerical columns with values like 0.0000000001 and so on, maybe changing that to 1, 2, 3, 4 and so one with an ordinal encoder improves the performance.
There was a problem hiding this comment.
but in its current form, its just working on the categorical columns. Also, that could be handled with the scaler component. Me and sebastian also discussed today about not having an encoding component. As when we merge the embedding PR which I'll finish soon, PyTorch has this nn.Embedding layer which performs OneHotEncoding and learns the embedding all by itself.
| A holdout set of labels | ||
| """ | ||
| if not self.is_classification or self.type_of_target == 'multilabel-indicator': | ||
| # Only fit an encoder for classification tasks |
There was a problem hiding this comment.
Should we also raise an error if in regression the data type of the target is not integer? Currently it seems like there is no check for that.
There was a problem hiding this comment.
I think #85 can handle this. Now I do not know what you mean exactly, as we have different regression targets. Usually we expected regression to have float type, which is something that the code handles right now.
There was a problem hiding this comment.
Oh I meant if somehow someone gives y values as a string. Which does not contain numbers but words, by mistake. I think in this validator module we should take care of this as well
ravinkohli
left a comment
There was a problem hiding this comment.
This PR looks really good. I have left a few minor comments to be addressed.
Enable any input type to autopytorch (
if input in (pd.DataFrame, List, sparse, numpy))Better error to user and plenty of more testing (included moving dataset to pytest)