[ADD] Calculate memory of dataset after one hot encoding (pytorch embedding) by ravinkohli · Pull Request #437 · automl/Auto-PyTorch

ravinkohli · 2022-06-15T13:21:10Z

This PR aims to improve the approximate memory usage of a dataset by considering the dataset after transforming with one-hot encoding. Based on our experiments (reg cocktails ablation study), we have observed that columns with a high cardinality of their categorical features tend to explode when they are one-hot encoded. Moreover, even with the addition of PyTorch embedding (removing the need to one hot encode all categorical columns), we observe that excessive memory is used while building the neural network.

nabenabe0928 · 2022-06-29T15:52:24Z

autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/TabularColumnTransformer.py

+            port=X['logger_port'
+                   ] if 'logger_port' in X else logging.handlers.DEFAULT_TCP_LOGGING_PORT,


Suggested change

port=X['logger_port'

] if 'logger_port' in X else logging.handlers.DEFAULT_TCP_LOGGING_PORT,

port=X.get('logger_port', logging.handlers.DEFAULT_TCP_LOGGING_PORT),

Actually, we don't need this code to be merged, I'll remove this.

nabenabe0928 · 2022-06-29T16:23:09Z

autoPyTorch/data/utils.py

+                else:
+                    multipliers.append(arr_dtypes[col].itemsize)


What happens in one-hot encoding when num_cat is larger than MIN_CATEGORIES_FOR_EMBEDDING_MAX

they are not one hot encoded but rather sent to the PyTorch embedding module where there is implicit one-hot encoding.

autoPyTorch/data/utils.py

nabenabe0928 · 2022-06-29T16:29:42Z

autoPyTorch/data/utils.py

+                raise ValueError(err_msg)
+            for col, num_cat in zip(categorical_columns, n_categories_per_cat_column):
+                if num_cat < MIN_CATEGORIES_FOR_EMBEDDING_MAX:
+                    multipliers.append(num_cat * arr_dtypes[col].itemsize)


Is it already guaranteed that all columns are non-object?
Otherwise, we should check it.

yes its guaranteed that all columns are not object, moreover, they are also guaranteed to be np arrays, as this code is run after we have transformed the data using the tabular feature validator.

autoPyTorch/data/utils.py

nabenabe0928 · 2022-06-29T16:35:24Z

autoPyTorch/data/utils.py

+        if len(categorical_columns) > 0:
+            if n_categories_per_cat_column is None:
+                raise ValueError(err_msg)
+            for col, num_cat in zip(categorical_columns, n_categories_per_cat_column):


We could use sum(...) same as below. (optional)

Co-authored-by: nabenabe0928 <47781922+nabenabe0928@users.noreply.github.com>

theodorju

As discussed in the meeting, I reviewed the changes. Everything looks good to me, I'm just adding a minor suggestion as a comment.

autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/TabularColumnTransformer.py

…ssing/TabularColumnTransformer.py

…edding) (#437) * add updates for apt1.0+reg_cocktails * debug loggers for checking data and network memory usage * add support for pandas, test for data passing, remove debug loggers * remove unwanted changes * : * Adjust formula to account for embedding columns * Apply suggestions from code review Co-authored-by: nabenabe0928 <47781922+nabenabe0928@users.noreply.github.com> * remove unwanted additions * Update autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/TabularColumnTransformer.py Co-authored-by: nabenabe0928 <47781922+nabenabe0928@users.noreply.github.com>

ravinkohli changed the base branch from reg_cocktails_apt1.0+reg_cocktails_pytorch_embedding to reg_cocktails-pytorch_embedding June 15, 2022 13:21

ravinkohli added 2 commits June 15, 2022 15:22

add updates for apt1.0+reg_cocktails

671edc5

debug loggers for checking data and network memory usage

f2f5f72

ravinkohli force-pushed the reg_cocktails_apt1.0+reg_cocktails_pytorch_embedding_debug branch from c2a98c9 to f2f5f72 Compare June 15, 2022 13:23

ravinkohli added 4 commits June 15, 2022 18:47

add support for pandas, test for data passing, remove debug loggers

47b5c51

remove unwanted changes

689fdcb

:

09015f3

Adjust formula to account for embedding columns

7a83942

nabenabe0928 reviewed Jun 29, 2022

View reviewed changes

ravinkohli and others added 2 commits July 1, 2022 12:46

Apply suggestions from code review

58f897e

Co-authored-by: nabenabe0928 <47781922+nabenabe0928@users.noreply.github.com>

remove unwanted additions

bff3edf

theodorju reviewed Jul 16, 2022

View reviewed changes

autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/TabularColumnTransformer.py Outdated Show resolved Hide resolved

ravinkohli commented Jul 16, 2022

View reviewed changes

autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/TabularColumnTransformer.py Outdated Show resolved Hide resolved

Update autoPyTorch/pipeline/components/preprocessing/tabular_preproce…

72f1c7c

…ssing/TabularColumnTransformer.py

ravinkohli merged commit 95a5969 into reg_cocktails-pytorch_embedding Jul 16, 2022

ravinkohli deleted the reg_cocktails_apt1.0+reg_cocktails_pytorch_embedding_debug branch March 16, 2023 13:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADD] Calculate memory of dataset after one hot encoding (pytorch embedding)#437

[ADD] Calculate memory of dataset after one hot encoding (pytorch embedding)#437
ravinkohli merged 9 commits intoreg_cocktails-pytorch_embeddingfrom
reg_cocktails_apt1.0+reg_cocktails_pytorch_embedding_debug

ravinkohli commented Jun 15, 2022

Uh oh!

nabenabe0928 Jun 29, 2022

Uh oh!

ravinkohli Jul 1, 2022

Uh oh!

nabenabe0928 Jun 29, 2022

Uh oh!

ravinkohli Jul 1, 2022

Uh oh!

Uh oh!

nabenabe0928 Jun 29, 2022

Uh oh!

ravinkohli Jul 1, 2022

Uh oh!

Uh oh!

Uh oh!

nabenabe0928 Jun 29, 2022

Uh oh!

theodorju left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		port=X['logger_port'
		] if 'logger_port' in X else logging.handlers.DEFAULT_TCP_LOGGING_PORT,

	port=X['logger_port'
	] if 'logger_port' in X else logging.handlers.DEFAULT_TCP_LOGGING_PORT,
	port=X.get('logger_port', logging.handlers.DEFAULT_TCP_LOGGING_PORT),

Conversation

ravinkohli commented Jun 15, 2022

Uh oh!

nabenabe0928 Jun 29, 2022

Choose a reason for hiding this comment

Uh oh!

ravinkohli Jul 1, 2022

Choose a reason for hiding this comment

Uh oh!

nabenabe0928 Jun 29, 2022

Choose a reason for hiding this comment

Uh oh!

ravinkohli Jul 1, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nabenabe0928 Jun 29, 2022

Choose a reason for hiding this comment

Uh oh!

ravinkohli Jul 1, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nabenabe0928 Jun 29, 2022

Choose a reason for hiding this comment

Uh oh!

theodorju left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants