Fast oblique by ClarkXu0625 · Pull Request #360 · neurodata/treeple

ClarkXu0625 · 2025-05-14T01:52:51Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Replace Fisher Yates Shuffle by Floyd's method, a more efficient approach to draw uniform distribution from large ranges, in SPORF (i.e. ObliqueRandomForestClassifier). When the projection matrix is huge (i.e. large number of data features and/or max_features), this update would reduce training time significantly without affecting prediction.

Any other comments?

A list of command code to reset

…tion matrix

adam2392

Cool! What's the general speedup you're observing w/ this alternative sampling method?

adam2392 · 2025-06-01T15:11:01Z

            indices_to_sample[i], indices_to_sample[j]


+cdef void floyd_sample_indices(


Can this be inlined?

Thanks, Yuxin!

After inlining the Floyd sampling method, the overhead in SPORF has been eliminated. Now, Floyd’s method is consistently faster than the original Fisher-Yates approach.
(Sorry for the inconsistency in training time below — the before-and-after tests were run on different physical machines.)

Here below are the comparisons between original treeple (using fisher yates shuffle, right) and new implemented treeple (left).

Before changing to inline:

After changing floyd to inline:

Yep! without inline there is overhead since fisher Yates function was also inlined. Few questions:

Can you put those plots on the same scale?

Also how many reps did you run per cell? If you ran 5-10 then this is nice.

The upper two heatmap ran with 5 reps, and the lower two (not inlined) ran with 3 reps.

Here below are the heatmaps in the same scale, 5 reps

adam2392 · 2025-06-02T21:21:53Z

        # sample 'n_non_zeros' in a mtry X n_features projection matrix
        # which consists of +/- 1's chosen at a 1/2s rate


Can you explain why the below for loop is still needed then? Perhaps what would help is some additional explanation of what the expected input/output is in the floyd_sample_indices.

The input would be:
out - preallocated buffer to hold the output of sampled indices (size ≥ k);
k - number of samples to be drawn;
n - Population size
and a random state

The output is out, filled in-place with k unique random integers selected uniformly without replacement from the interval [0, n).

In _oblique_splitter.pyx, it samples n_non_zeros unique integers from the range [0, grid_size) — and stores them in indices_to_sample[:]. The for loop is needed because we need to map 1D indices back to 2D coordinates in the projection matrix and assign weights to each index

ClarkXu0625 · 2025-06-03T19:16:02Z

Cool! What's the general speedup you're observing w/ this alternative sampling method?

The speedup scale depends on the projection matrix size (max_features and number of features in dataset). When the number of features in dataset increases and the feature_combinations kept the same, the new sampling methods would have relatively constant training time. In contrast, the original sampling method would have exponentially increased training time, while the dataset has increased number of feature and keeping the other conditions constant.

It could speed up 10 times when the projection matrix is as large as 4096x4096. However, for smaller projection matrices (e.g., 64×64), the new method may incur a slight overhead, resulting in a slowdown of less than 10%.

Co-authored-by: Adam Li <adam2392@gmail.com>

make floyd_sample_indices() inlined

YuxinB · 2025-06-06T04:06:03Z

Test the pr by generate the oblique demo with the Flody's sampling method: feature_combination = 1.5 here

PSSF23 · 2025-06-10T18:14:47Z

+    cdef intp_t i, r, count = 0
+
+    for i in range(n - k, n):
+        r = rand_int(0, i + 1, random_state)
+        if seen.find(r) == seen.end():
+            seen.insert(r)
+            out[count] = r
+        else:
+            seen.insert(i)
+            out[count] = i
+        count += 1


cdef intp_t i, r = 0 for i in range(n - k, n): r = rand_int(0, i + 1, random_state) if seen.find(r) == seen.end(): seen.insert(r) out[i - n + k] = r else: seen.insert(i) out[i - n + k] = i

A little simplification suggestion. @ClarkXu0625

Thank you, Hao! Just tested the running time for this cleaner version
(buttom right is the result training time from this improved version)

PSSF23

@ClarkXu0625 Awesome. Just clean up the styles according to the cython-lint errors. And I think we are good to go? @adam2392 @YuxinB

YuxinB · 2025-06-12T03:38:03Z

Kay, will merge! Thank you all!

ClarkXu0625 added 30 commits April 27, 2025 21:10

Create test_profile.ipynb

77b4744

Copy important result notebook

5a5b025

Create README.md

23f2918

A list of command code to reset

Update test_profile.ipynb

8737180

Update test_profile.ipynb

a9d5a5f

Update constant_non0_projection_exp_4_21.ipynb

e551a7c

Update README.md

9cf6e10

update profile result

e3c2481

Update README.md

80db642

Update README.md

d3056c4

Update README.md

e0d2dd2

Update README.md

c4a1d60

save previous experiments

ef49fd8

to profile oblique splitter

fbc4a86

check local implementation

f996350

Re-organize

70f9a1e

constant nonzeros per row

160a4e2

save constant per row results

1d1ed0f

Update constant_non0_projection_exp_4_21.ipynb

272e3b7

Update create_dateset.ipynb

05568df

Update test_profile.ipynb

c9c576f

Update test_profile.ipynb

beb3803

rename

928295c

Update README.md

3684265

results with larger projection matrices

d829d73

update running time on linux

e4018f6

remove repeating cells

4ea6000

fix wrong plots

401065c

add comparison between training time at different non-zeros in projec…

d09ca58

…tion matrix

Update README.md

8252f75

Clark Xu and others added 14 commits May 15, 2025 13:04

remove fisher yates

da4083f

delete repeat function

b42b75e

remove clark experiment folder

1c86a58

remove sklearn

d17be19

update to match original version

2e37081

delete printline

aad9d9c

delete profiling

bc94055

delete comment

4930e3c

keep fisher yates for potential use

daf443d

remove comments

26ccbb5

remove unused import

1705c7d

remove repeating functions

d94cd3c

Update _utils.pyx

4df1183

remove import that does not match

a7485c8

YuxinB requested review from PSSF23 and adam2392 and removed request for adam2392 May 31, 2025 21:15

adam2392 reviewed Jun 2, 2025

View reviewed changes

ClarkXu0625 and others added 3 commits June 4, 2025 03:16

Update treeple/tree/_utils.pyx

d60d350

Co-authored-by: Adam Li <adam2392@gmail.com>

Update treeple/tree/_utils.pyx

69b819a

Co-authored-by: Adam Li <adam2392@gmail.com>

Update _utils.pyx

a48afb7

make floyd_sample_indices() inlined

PSSF23 reviewed Jun 10, 2025

View reviewed changes

cleaner floyd

7e55b2c

PSSF23 reviewed Jun 11, 2025

View reviewed changes

style fix

069c9bc

YuxinB merged commit 752d589 into neurodata:fast-oblique Jun 14, 2025
21 of 32 checks passed

		indices_to_sample[i], indices_to_sample[j]


		cdef void floyd_sample_indices(

		# sample 'n_non_zeros' in a mtry X n_features projection matrix
		# which consists of +/- 1's chosen at a 1/2s rate

Uh oh!

Conversation

ClarkXu0625 commented May 14, 2025

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

adam2392 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ClarkXu0625 commented Jun 3, 2025

Uh oh!

YuxinB commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PSSF23 left a comment

Choose a reason for hiding this comment

Uh oh!

YuxinB commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

YuxinB commented Jun 6, 2025 •

edited

Loading

YuxinB commented Jun 12, 2025 •

edited

Loading