bench: Migrate Python cuda.compute benchmarks to nvbench by danielfrg · Pull Request #7341 · NVIDIA/cccl

danielfrg · 2026-01-23T21:54:09Z

Description

closes #7317

Migrating the Python cuda.compute benchmarks from pytest to nvbench.

Adds equivalent C++ benchmarks to compare Python vs C++

copy-pr-bot · 2026-01-23T21:54:17Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

danielfrg · 2026-01-23T21:56:33Z

@shwina I migrated one benchmark to nvbench here and added C++ equivalent. If you can review and see if this is what you were thinking i can continue with the rest.

I added some instructions and doing the C++ part with cmake but let me know if you have other ideas on how to make that part better/easier.

On my dev GPU (4090) I got:

# bench_unary_transform_pointer

GPU Time: Mean GPU execution time (cold start, pure kernel)
  CUDA events (nvbench tag: nv/cold/time/gpu/mean)
CPU Time: Mean CPU (host) latency
  Host clock (nvbench tag: nv/cold/time/cpu/mean)

## [0] NVIDIA GeForce RTX 4090

| Elements   | C++ GPU    | Py GPU     | % Slower   | C++ CPU    | Py CPU     | CPU Ovhd   |
|------------|------------|------------|------------|------------|------------|------------|
| 2^1        | 3.097 us   | 3.392 us   | 9.51%      | 15.209 us  | 22.101 us  | 6.892 us   |
| 2^10       | 3.126 us   | 4.525 us   | 44.72%     | 15.202 us  | 23.571 us  | 8.369 us   |
| 2^14       | 3.307 us   | 3.607 us   | 9.07%      | 15.399 us  | 22.187 us  | 6.788 us   |
| 2^17       | 4.902 us   | 6.229 us   | 27.08%     | 16.972 us  | 25.270 us  | 8.298 us   |
| 2^20       | 12.139 us  | 12.571 us  | 3.56%      | 24.200 us  | 30.918 us  | 6.718 us   |
| 2^24       | 149.988 us | 150.786 us | 0.53%      | 162.432 us | 169.373 us | 6.941 us   |
| 2^26       | 586.332 us | 584.699 us | -0.28%     | 598.789 us | 603.301 us | 4.511 us   |

shwina · 2026-01-25T15:04:38Z

Thanks @danielfrg - I think this is a really great start!

With regards to comparing C++ and Python benchmarks, I wonder if we can't use the existing nvbench_compare script - https://github.com/NVIDIA/nvbench/blob/main/scripts/nvbench_compare.py? IIRC @NaderAlAwar has had some previous experience doing this, so it would be great to have his feedback as well.

NaderAlAwar

Thanks for working on this @danielfrg! The transform benchmark looks great. Two points:

As @shwina pointed out, the comparison script (and most of the other analysis scripts) already exist in nvbench, so we should not copy them. I am working on releasing a pynvbench wheel, and will figure out a way to expose that functionality properly, Once the wheels are released, we should make pynvbench an optional dependency that you can add when you pip install cuda-cccl.
We should not add the cpp version of the benchmark. The CUB benchmarks already exist under cub/benchmarks/bench/, so in my opinion, we should try to mimic at least one benchmark from each CUB algorithm so we can compare the two properly.

I'm curious about python_vs_cpp_summary.py. Can you show me what the output would look like? Doesn't have to be from a real run, just a visualization

python/cuda_cccl/benchmarks/compute/nvbench/nvbench_transform.py

danielfrg · 2026-01-26T16:09:52Z

Agree, the nvbench scripts are here at the moment for convenience, if you are able to expose them in the wheels then we can remove. Those work great but I think the python vs c++ one has a but more convenient reports for this case in general, the summary table I have on my first comment is from that script.
Makes sense on trying to reuse the existing benchmarks, for this example one I was just converting the existing python one from pytest. Do we want to convert the ones we have instead? Lets say starting with this one: https://github.com/NVIDIA/cccl/blob/main/cub/benchmarks/bench/transform/fill.cu ?

NaderAlAwar · 2026-01-26T16:15:34Z

I see now, yeah the python vs cpp summary script should stay. I think it would be more valuable to convert the existing C++ ones, and yes fill.cu seems like a good start.

NaderAlAwar · 2026-01-26T16:17:10Z

Also, take a look at https://github.com/NVIDIA/cccl/blob/main/CONTRIBUTING.md#code-formatting-pre-commit-hooks to fix the pre-commit ci issue

danielfrg · 2026-01-28T18:39:25Z

I updated the one benchmark here to be based on fill.cu and I got this that in general it looks good to me: For large data sizes we between the 5% target:

…/compute/nvbench cuda-compute-nvbench ? ❯ python analysis/python_vs_cpp_summary.py results/fill_py.json results/fill_cpp.json
# bench_fill

GPU Time: Mean GPU execution time (cold start, pure kernel)
  CUDA events (nvbench tag: nv/cold/time/gpu/mean)
CPU Time: Mean CPU (host) latency
  Host clock (nvbench tag: nv/cold/time/cpu/mean)

## [0] NVIDIA GeForce RTX 4090

### Type: I16

| Elements   | C++ GPU    | Py GPU     | % Slower   | C++ CPU    | Py CPU     | CPU Ovhd   |
|------------|------------|------------|------------|------------|------------|------------|
| 2^16       | 2.754 us   | 4.278 us   | 55.35%     | 14.796 us  | 23.269 us  | 8.474 us   |
| 2^20       | 3.532 us   | 4.972 us   | 40.76%     | 15.715 us  | 23.905 us  | 8.190 us   |
| 2^24       | 34.311 us  | 40.369 us  | 17.65%     | 46.510 us  | 58.837 us  | 12.327 us  |
| 2^28       | 558.510 us | 565.441 us | 1.24%      | 570.950 us | 584.637 us | 13.687 us  |
| 2^32       | 8.952 ms   | 8.958 ms   | 0.06%      | 8.964 ms   | 8.978 ms   | 13.978 us  |

### Type: I32

| Elements   | C++ GPU   | Py GPU    | % Slower   | C++ CPU   | Py CPU    | CPU Ovhd   |
|------------|-----------|-----------|------------|-----------|-----------|------------|
| 2^16       | 2.861 us  | 3.365 us  | 17.64%     | 14.940 us | 21.710 us | 6.770 us   |
| 2^20       | 4.511 us  | 4.837 us  | 7.24%      | 16.391 us | 23.517 us | 7.127 us   |
| 2^24       | 69.059 us | 67.794 us | -1.83%     | 81.073 us | 85.855 us | 4.782 us   |
| 2^28       | 1.118 ms  | 1.120 ms  | 0.21%      | 1.130 ms  | 1.139 ms  | 8.893 us   |

### Type: I64

| Elements   | C++ GPU    | Py GPU     | % Slower   | C++ CPU    | Py CPU     | CPU Ovhd   |
|------------|------------|------------|------------|------------|------------|------------|
| 2^16       | 2.881 us   | 4.293 us   | 48.99%     | 14.974 us  | 23.148 us  | 8.174 us   |
| 2^20       | 8.536 us   | 9.735 us   | 14.04%     | 20.425 us  | 28.462 us  | 8.037 us   |
| 2^24       | 140.137 us | 140.560 us | 0.30%      | 152.051 us | 159.512 us | 7.460 us   |
| 2^28       | 2.239 ms   | 2.239 ms   | -0.00%     | 2.252 ms   | 2.259 ms   | 7.814 us   |

### Type: I8

| Elements   | C++ GPU    | Py GPU     | % Slower   | C++ CPU    | Py CPU     | CPU Ovhd   |
|------------|------------|------------|------------|------------|------------|------------|
| 2^16       | 2.727 us   | 3.350 us   | 22.85%     | 14.533 us  | 21.981 us  | 7.448 us   |
| 2^20       | 3.160 us   | 3.842 us   | 21.59%     | 15.063 us  | 22.308 us  | 7.245 us   |
| 2^24       | 17.222 us  | 22.981 us  | 33.45%     | 29.363 us  | 41.180 us  | 11.817 us  |
| 2^28       | 278.926 us | 285.592 us | 2.39%      | 291.525 us | 304.250 us | 12.725 us  |
| 2^32       | 4.476 ms   | 4.482 ms   | 0.14%      | 4.488 ms   | 4.502 ms   | 13.767 us  |

I had claude to analyze the APIs and give me a mapping for the transform benchmarks:

CUB Benchmark	CUB API	Python API	Status	Notes
`transform/fill.cu`	`cub::detail::transform::dispatch` with `return_constant`	`unary_transform` + `ConstantIterator`	✅ DONE	Using ConstantIterator(42) as 0-ary generator
`transform/babelstream.cu`	`cub::detail::transform::dispatch` (mul, add, triad, nstream)	`unary_transform`, `binary_transform`	✅ Ready	Multiple operations: ascalar, a+b, a+bscalar, a+b+c*scalar
`transform/heavy.cu`	`cub::detail::transform::dispatch` with heavy compute	`unary_transform`	✅ Ready	Custom function with configurable compute intensity
`transform/fib.cu`	`cub::detail::transform::dispatch`	`unary_transform`	✅ Ready	Fibonacci computation
`transform/grayscale.cu`	`cub::detail::transform::dispatch` with struct input	`unary_transform` + `@gpu_struct`	✅ Ready	RGB struct → grayscale, need gpu_struct for RGB type
`transform/complex_cmp.cu`	`cub::detail::transform::dispatch`	`unary_transform`	✅ Ready	Complex comparison logic
`transform_reduce/sum.cu`	`cub::detail::reduce::dispatch` with transform	`reduce_into` with `TransformIterator`	⚠️ Workaround	Need to use TransformIterator or pre-transform

They look good to me on a quick look. Let me know what you guys prefer in terms of PRs, do we want to keep them all in a single PR? If the structure looks good now I will start with more transform benchmarks.

python/cuda_cccl/benchmarks/compute/nvbench/nvbench_fill.py

python/cuda_cccl/benchmarks/compute/nvbench/CMakeLists.txt

danielfrg · 2026-02-05T15:32:05Z

Changed the C++ build script to use the one in the ci folder and added a single script to run the benchmarks.

I added two extra benchmarks to show how that worked. The babelstream one looks ok in my GPU but the transform_heavy one the performance has big differences:

# heavy

GPU Time: Mean GPU execution time (cold start, pure kernel)
  CUDA events (nvbench tag: nv/cold/time/gpu/mean)
CPU Time: Mean CPU (host) latency
  Host clock (nvbench tag: nv/cold/time/cpu/mean)

## [0] NVIDIA GeForce RTX 4090

### Heaviness=128

| Elements   | C++ GPU    | Py GPU     | % Slower   | C++ CPU    | Py CPU     | CPU Ovhd   |
|------------|------------|------------|------------|------------|------------|------------|
| 2^16       | 6.171 us   | 12.469 us  | 102.07%    | 18.244 us  | 28.769 us  | 10.525 us  |
| 2^20       | 48.865 us  | 145.804 us | 198.38%    | 60.683 us  | 162.066 us | 101.383 us |
| 2^24       | 613.619 us | 2.187 ms   | 256.37%    | 626.126 us | 2.204 ms   | 1.578 ms   |
| 2^28       | 9.894 ms   | 34.658 ms  | 250.30%    | 9.906 ms   | 34.677 ms  | 24.770 ms  |

### Heaviness=256

| Elements   | C++ GPU    | Py GPU     | % Slower   | C++ CPU    | Py CPU     | CPU Ovhd   |
|------------|------------|------------|------------|------------|------------|------------|
| 2^16       | 9.700 us   | 384.247 us | 3861.21%   | 21.617 us  | 400.758 us | 379.141 us |
| 2^20       | 105.618 us | 9.129 ms   | 8543.15%   | 117.345 us | 9.147 ms   | 9.029 ms   |
| 2^24       | 1.541 ms   | 149.250 ms | 9586.40%   | 1.553 ms   | 149.271 ms | 147.718 ms |
| 2^28       | 24.399 ms  | 2.416 s    | 9800.04%   | 24.412 ms  | 2.416 s    | 2.391 s    |

### Heaviness=32

| Elements   | C++ GPU    | Py GPU     | % Slower   | C++ CPU    | Py CPU     | CPU Ovhd   |
|------------|------------|------------|------------|------------|------------|------------|
| 2^16       | 4.316 us   | 5.651 us   | 30.94%     | 16.286 us  | 22.287 us  | 6.001 us   |
| 2^20       | 15.354 us  | 36.968 us  | 140.77%    | 27.240 us  | 53.072 us  | 25.832 us  |
| 2^24       | 174.735 us | 495.926 us | 183.82%    | 186.898 us | 512.593 us | 325.695 us |
| 2^28       | 2.889 ms   | 7.961 ms   | 175.57%    | 2.902 ms   | 7.979 ms   | 5.077 ms   |

### Heaviness=64

| Elements   | C++ GPU    | Py GPU    | % Slower   | C++ CPU    | Py CPU    | CPU Ovhd   |
|------------|------------|-----------|------------|------------|-----------|------------|
| 2^16       | 4.989 us   | 7.757 us  | 55.49%     | 16.932 us  | 24.156 us | 7.224 us   |
| 2^20       | 25.499 us  | 76.779 us | 201.10%    | 37.433 us  | 92.815 us | 55.382 us  |
| 2^24       | 317.748 us | 1.005 ms  | 216.32%    | 329.875 us | 1.022 ms  | 692.030 us |
| 2^28       | 5.204 ms   | 15.931 ms | 206.15%    | 5.216 ms   | 15.949 ms | 10.733 ms  |


Comparison saved to: /home/danielfrg/Documents/cccl/python/cuda_cccl/benchmarks/compute/nvbench/results/heavy_comparison.txt

Any ideas on why this might be happening? My guess would be mostly on the numba kernels or compilation of those?

NaderAlAwar

This looks great! Thanks Daniel!

NaderAlAwar · 2026-02-05T15:49:11Z

python/cuda_cccl/benchmarks/compute/environment.yml

+    - cuda-bench[cu12]>=0.2.0
+    - cuda-cccl[cu12]
+    - cupy-cuda12x


Question: is there a way to not hardcode this to cu12?

Not 100% (i think) my plan was to have another environment.yml for cuda 13.

NaderAlAwar · 2026-02-05T15:50:57Z

pre-commit.ci autofix

NaderAlAwar · 2026-02-05T15:54:08Z

Any ideas on why this might be happening? My guess would be mostly on the numba kernels or compilation of those?

These results are interesting, we should definitely look into them. I would assume it is something in numba, but I would have to look at the SASS to say for sure. In any case, this should not block the PR. Can you create an issue noting this? Lets maybe have an EPIC issue comparing cuda.compute performance to CUB, and have this specific instance be a subissue

danielfrg · 2026-02-05T16:01:44Z

Agree it should not block this one. I will create an epic issue to analyse the results.
I'll continue adding the other transform benchmarks on this PR unless you all prefer something else.

NaderAlAwar · 2026-02-05T16:36:39Z

It's up to you. We can merge this or you can add the other benchmarks first

danielfrg · 2026-02-05T17:54:38Z

Ok, then I will add a couple more here!

Co-authored-by: Nader Al Awar <naderalawar@gmail.com>

danielfrg · 2026-03-17T14:37:51Z

I think i addressed all the previous comments

NaderAlAwar

Benchmarks look great now, left a few comments to clean things up. It woud be good to run the benchmarks now and compare them to the C++ benchmarks to see where we stand at this point

python/cuda_cccl/benchmarks/compute/segmented_reduce/sum.py

python/cuda_cccl/benchmarks/compute/segmented_sort/keys.py

python/cuda_cccl/benchmarks/compute/transform/babelstream.py

python/cuda_cccl/benchmarks/compute/run_benchmarks.py

python/cuda_cccl/benchmarks/compute/quick_configs.yaml

python/cuda_cccl/benchmarks/compute/run_benchmarks.py

python/cuda_cccl/benchmarks/compute/pixi.lock

python/cuda_cccl/benchmarks/compute/.opencode/commands/migration-status.md

python/cuda_cccl/benchmarks/compute/utils.py

Co-authored-by: Nader Al Awar <naderalawar@gmail.com>

danielfrg · 2026-03-17T17:30:58Z

All done except one question i left about the --quick, i would like to keep it but we can probably refactor into a --config that allows multiple config for the benchmarks for diff scenarios?

danielfrg · 2026-03-19T16:18:52Z

Ok since you want to keep the --quick option now we should be mostly good to go.

These are the latest results on my dev machine:

results.zip

I'll open up issues to track the performance of the transforms that we do not match at the moment

NaderAlAwar · 2026-03-19T20:53:47Z

Great work @danielfrg! Could you also open up issues for segmented_sort/keys and histogram/even? Histogram being faster in python is suspicious.

NaderAlAwar · 2026-03-19T20:54:13Z

/ok to test eca064b

github-actions · 2026-03-19T22:23:22Z

🥳 CI Workflow Results

🟩 Finished in 1h 26m: Pass: 100%/48 | Total: 15h 13m | Max: 57m 59s

See results here.

danielfrg requested review from a team as code owners January 23, 2026 21:54

github-project-automation bot added this to CCCL Jan 23, 2026

danielfrg requested a review from gonidelis January 23, 2026 21:54

github-project-automation bot moved this to Todo in CCCL Jan 23, 2026

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Jan 23, 2026

NaderAlAwar reviewed Jan 26, 2026

View reviewed changes

python/cuda_cccl/benchmarks/compute/nvbench/nvbench_transform.py Outdated Show resolved Hide resolved

python/cuda_cccl/benchmarks/compute/nvbench/nvbench_transform.py Outdated Show resolved Hide resolved

NaderAlAwar reviewed Jan 28, 2026

View reviewed changes

python/cuda_cccl/benchmarks/compute/nvbench/nvbench_fill.py Outdated Show resolved Hide resolved

python/cuda_cccl/benchmarks/compute/nvbench/CMakeLists.txt Outdated Show resolved Hide resolved

NaderAlAwar approved these changes Feb 5, 2026

View reviewed changes

This was referenced Feb 10, 2026

[Python] cudaErrorIllegalAddress happens NVIDIA/nvbench#321

Open

[BUG]: cuda.compute caching numpy ignores shape/values #7622

Closed

danielfrg requested review from a team as code owners February 11, 2026 20:45

danielfrg and others added 7 commits March 16, 2026 10:43

Update python/cuda_cccl/benchmarks/compute/reduce/sum.py

89d3479

Co-authored-by: Nader Al Awar <naderalawar@gmail.com>

Update python/cuda_cccl/benchmarks/compute/reduce/custom.py

9551779

Co-authored-by: Nader Al Awar <naderalawar@gmail.com>

Update python/cuda_cccl/benchmarks/compute/partition/three_way.py

2307512

Co-authored-by: Nader Al Awar <naderalawar@gmail.com>

remove reduce/deterministic.py

466f63f

Use generate_data_with_entropy

c479c9b

match try/except for memory errors

ddab769

precommit

880f15c

NaderAlAwar reviewed Mar 17, 2026

View reviewed changes

danielfrg and others added 8 commits March 17, 2026 11:52

Update python/cuda_cccl/benchmarks/compute/segmented_sort/keys.py

e2591f0

Co-authored-by: Nader Al Awar <naderalawar@gmail.com>

Remove segmented_reduce/sum

5e020c7

move try/except to just mem allocation

2b9fb98

remove benchmarks

cc1a8ec

remove pixi.lock and mention of analysis tool

84661de

Remove opencode command

ed46bf7

Remove unused type var

729b77a

Add precommit to pixi

ff2219d

shwina mentioned this pull request Mar 18, 2026

Migrate benchmarks from pytest-benchmark to nvbench (https://github.com/NVIDIA/nvbench/) #7318

Open

Remove pytest benchmarks

eca064b

gevtushenko merged commit d93715e into main Mar 24, 2026
64 of 67 checks passed

gevtushenko deleted the cuda-compute-nvbench branch March 24, 2026 15:17

Conversation

danielfrg commented Jan 23, 2026

Description

Uh oh!

copy-pr-bot bot commented Jan 23, 2026

Uh oh!

danielfrg commented Jan 23, 2026

Uh oh!

shwina commented Jan 25, 2026

Uh oh!

NaderAlAwar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

danielfrg commented Jan 26, 2026

Uh oh!

NaderAlAwar commented Jan 26, 2026

Uh oh!

NaderAlAwar commented Jan 26, 2026

Uh oh!

danielfrg commented Jan 28, 2026

Uh oh!

Uh oh!

Uh oh!

danielfrg commented Feb 5, 2026

Uh oh!

NaderAlAwar left a comment

Choose a reason for hiding this comment

Uh oh!

NaderAlAwar Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

danielfrg Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

NaderAlAwar commented Feb 5, 2026

Uh oh!

NaderAlAwar commented Feb 5, 2026

Uh oh!

danielfrg commented Feb 5, 2026

Uh oh!

NaderAlAwar commented Feb 5, 2026

Uh oh!

danielfrg commented Feb 5, 2026

Uh oh!

danielfrg commented Mar 17, 2026

Uh oh!

NaderAlAwar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danielfrg commented Mar 17, 2026

Uh oh!

danielfrg commented Mar 19, 2026

Uh oh!

NaderAlAwar commented Mar 19, 2026

Uh oh!

NaderAlAwar commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026

🥳 CI Workflow Results

🟩 Finished in 1h 26m: Pass: 100%/48 | Total: 15h 13m | Max: 57m 59s

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development