Skip to content

Support NaN values in UpliftTree and UpliftRandomForest#860

Open
aman-coder03 wants to merge 6 commits intouber:masterfrom
aman-coder03:feature/handle-nan-uplift-trees
Open

Support NaN values in UpliftTree and UpliftRandomForest#860
aman-coder03 wants to merge 6 commits intouber:masterfrom
aman-coder03:feature/handle-nan-uplift-trees

Conversation

@aman-coder03
Copy link

This PR adds native support for missing values (NaNs) in UpliftTree and
UpliftRandomForest.

Each candidate split evaluates both possible NaN routing directions
(left/right) and learns the optimal routing per node, similar to
scikit-learn’s decision tree behavior.

The learned NaN routing is stored in each tree node and applied
consistently during training, pruning, filling, and prediction.

This resolves #802.

@CLAassistant
Copy link

CLAassistant commented Dec 29, 2025

CLA assistant check
All committers have signed the CLA.

@aman-coder03
Copy link
Author

The Read the Docs build failure appears to be due to Cython extensions requiring a compiler in the RTD environment.
No documentation-related files were modified in this PR.
Happy to adjust the RTD config if you’d like me to handle this

Copy link
Collaborator

@jeongyoonlee jeongyoonlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution, @aman-coder03. I left a few comments. Can you address them? Also, please add test code to tests/test_uplift_trees.py accordingly. Thanks!

is_split_by_gt = False

for value in lsUnique:
len_X_l = group_counts_by_divide(columnValues, value, is_split_by_gt, treatment_idx, y, left_count_arr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this code block. Is there any reason you removed it?

n_reg
)

early_stopping_flag = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this code block for early stopping.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NaN handling is needed here in the percentile calculation.

@aman-coder03
Copy link
Author

@jeongyoonlee i will look into this!

@aman-coder03 aman-coder03 force-pushed the feature/handle-nan-uplift-trees branch from e13b20b to bd23f63 Compare February 3, 2026 15:03
@aman-coder03
Copy link
Author

hey @jeongyoonlee i have implemented the changes, please have a look!

@jeongyoonlee
Copy link
Collaborator

@aman-coder03, I see some code blocks necessary (e.g., code for the CTS evaluation criteria) are removed without any comments. Also, some tests are failing.

@aman-coder03
Copy link
Author

Thanks for pointing that out. I will restore the original CTS and evaluation blocks and limit changes strictly to NaN routing logic to preserve original behavior @jeongyoonlee

@aman-coder03
Copy link
Author

@jeongyoonlee can you please have a look at the code changes now

@aman-coder03 aman-coder03 force-pushed the feature/handle-nan-uplift-trees branch from 97c291e to 4163bf5 Compare February 11, 2026 19:46
@jeongyoonlee
Copy link
Collaborator

Code review

Found 1 issue:

  1. np.isnan() called unconditionally on non-numeric columns in divideSet, causing a TypeError for any dataset with string/categorical features.

The new NaN routing code in divideSet applies np.isnan(X[:, column]) without checking whether the column is numeric:

filt = X[:, column] == value
# Handle NaNs
nan_mask = np.isnan(X[:, column])
if missing_go_to_left:

divideSet handles both numeric splits (isinstance(value, numbers.Number)) and string splits (the else branch). When X[:, column] is an object array of strings, np.isnan raises:

TypeError: ufunc 'isnan' not supported for the input types

The companion function divideSet_len in this same PR correctly guards the NaN logic with if np.issubdtype(X[:, column].dtype, np.number)::

# Handle NaNs only for numeric columns
if np.issubdtype(X[:, column].dtype, np.number):
nan_mask = np.isnan(X[:, column])
if missing_go_to_left:
filt = filt | nan_mask
else:

The same unguarded np.isnan pattern also appears in group_counts_by_divide:

# Handle NaNs
cdef np.ndarray[np.uint8_t, ndim=1, cast=True] nan_mask = np.isnan(col_vals)
if missing_go_to_left:
filt = filt | nan_mask

The fix is to wrap both np.isnan calls with the same numeric dtype guard used in divideSet_len. The test added in this PR only injects NaNs into a numeric feature column, so it would not catch this regression.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for Inputs with NaN in Uplift Trees/Uplift RandomForest

3 participants