[ADD] Calculate memory of dataset after one hot encoding (pytorch embedding)#437
Conversation
c2a98c9 to
f2f5f72
Compare
| port=X['logger_port' | ||
| ] if 'logger_port' in X else logging.handlers.DEFAULT_TCP_LOGGING_PORT, |
There was a problem hiding this comment.
| port=X['logger_port' | |
| ] if 'logger_port' in X else logging.handlers.DEFAULT_TCP_LOGGING_PORT, | |
| port=X.get('logger_port', logging.handlers.DEFAULT_TCP_LOGGING_PORT), |
There was a problem hiding this comment.
Actually, we don't need this code to be merged, I'll remove this.
| else: | ||
| multipliers.append(arr_dtypes[col].itemsize) |
There was a problem hiding this comment.
What happens in one-hot encoding when num_cat is larger than MIN_CATEGORIES_FOR_EMBEDDING_MAX
There was a problem hiding this comment.
they are not one hot encoded but rather sent to the PyTorch embedding module where there is implicit one-hot encoding.
| raise ValueError(err_msg) | ||
| for col, num_cat in zip(categorical_columns, n_categories_per_cat_column): | ||
| if num_cat < MIN_CATEGORIES_FOR_EMBEDDING_MAX: | ||
| multipliers.append(num_cat * arr_dtypes[col].itemsize) |
There was a problem hiding this comment.
Is it already guaranteed that all columns are non-object?
Otherwise, we should check it.
There was a problem hiding this comment.
yes its guaranteed that all columns are not object, moreover, they are also guaranteed to be np arrays, as this code is run after we have transformed the data using the tabular feature validator.
| if len(categorical_columns) > 0: | ||
| if n_categories_per_cat_column is None: | ||
| raise ValueError(err_msg) | ||
| for col, num_cat in zip(categorical_columns, n_categories_per_cat_column): |
There was a problem hiding this comment.
We could use sum(...) same as below. (optional)
Co-authored-by: nabenabe0928 <47781922+nabenabe0928@users.noreply.github.com>
theodorju
left a comment
There was a problem hiding this comment.
As discussed in the meeting, I reviewed the changes. Everything looks good to me, I'm just adding a minor suggestion as a comment.
autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/TabularColumnTransformer.py
Outdated
Show resolved
Hide resolved
autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/TabularColumnTransformer.py
Outdated
Show resolved
Hide resolved
…ssing/TabularColumnTransformer.py
…edding) (#437) * add updates for apt1.0+reg_cocktails * debug loggers for checking data and network memory usage * add support for pandas, test for data passing, remove debug loggers * remove unwanted changes * : * Adjust formula to account for embedding columns * Apply suggestions from code review Co-authored-by: nabenabe0928 <47781922+nabenabe0928@users.noreply.github.com> * remove unwanted additions * Update autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/TabularColumnTransformer.py Co-authored-by: nabenabe0928 <47781922+nabenabe0928@users.noreply.github.com>
…edding) (#437) * add updates for apt1.0+reg_cocktails * debug loggers for checking data and network memory usage * add support for pandas, test for data passing, remove debug loggers * remove unwanted changes * : * Adjust formula to account for embedding columns * Apply suggestions from code review Co-authored-by: nabenabe0928 <47781922+nabenabe0928@users.noreply.github.com> * remove unwanted additions * Update autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/TabularColumnTransformer.py Co-authored-by: nabenabe0928 <47781922+nabenabe0928@users.noreply.github.com>
This PR aims to improve the approximate memory usage of a dataset by considering the dataset after transforming with one-hot encoding. Based on our experiments (reg cocktails ablation study), we have observed that columns with a high cardinality of their categorical features tend to explode when they are one-hot encoded. Moreover, even with the addition of PyTorch embedding (removing the need to one hot encode all categorical columns), we observe that excessive memory is used while building the neural network.