Skip to content

bench: Migrate Python cuda.compute benchmarks to nvbench#7341

Merged
gevtushenko merged 20 commits intomainfrom
cuda-compute-nvbench
Mar 24, 2026
Merged

bench: Migrate Python cuda.compute benchmarks to nvbench#7341
gevtushenko merged 20 commits intomainfrom
cuda-compute-nvbench

Conversation

@danielfrg
Copy link
Copy Markdown
Contributor

Description

closes #7317

Migrating the Python cuda.compute benchmarks from pytest to nvbench.

  • Adds equivalent C++ benchmarks to compare Python vs C++

@danielfrg danielfrg requested review from a team as code owners January 23, 2026 21:54
@danielfrg danielfrg requested a review from gonidelis January 23, 2026 21:54
@github-project-automation github-project-automation bot moved this to Todo in CCCL Jan 23, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot bot commented Jan 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Jan 23, 2026
@danielfrg
Copy link
Copy Markdown
Contributor Author

@shwina I migrated one benchmark to nvbench here and added C++ equivalent. If you can review and see if this is what you were thinking i can continue with the rest.

I added some instructions and doing the C++ part with cmake but let me know if you have other ideas on how to make that part better/easier.

On my dev GPU (4090) I got:

# bench_unary_transform_pointer

GPU Time: Mean GPU execution time (cold start, pure kernel)
  CUDA events (nvbench tag: nv/cold/time/gpu/mean)
CPU Time: Mean CPU (host) latency
  Host clock (nvbench tag: nv/cold/time/cpu/mean)

## [0] NVIDIA GeForce RTX 4090

| Elements   | C++ GPU    | Py GPU     | % Slower   | C++ CPU    | Py CPU     | CPU Ovhd   |
|------------|------------|------------|------------|------------|------------|------------|
| 2^1        | 3.097 us   | 3.392 us   | 9.51%      | 15.209 us  | 22.101 us  | 6.892 us   |
| 2^10       | 3.126 us   | 4.525 us   | 44.72%     | 15.202 us  | 23.571 us  | 8.369 us   |
| 2^14       | 3.307 us   | 3.607 us   | 9.07%      | 15.399 us  | 22.187 us  | 6.788 us   |
| 2^17       | 4.902 us   | 6.229 us   | 27.08%     | 16.972 us  | 25.270 us  | 8.298 us   |
| 2^20       | 12.139 us  | 12.571 us  | 3.56%      | 24.200 us  | 30.918 us  | 6.718 us   |
| 2^24       | 149.988 us | 150.786 us | 0.53%      | 162.432 us | 169.373 us | 6.941 us   |
| 2^26       | 586.332 us | 584.699 us | -0.28%     | 598.789 us | 603.301 us | 4.511 us   |

@shwina
Copy link
Copy Markdown
Contributor

shwina commented Jan 25, 2026

Thanks @danielfrg - I think this is a really great start!

With regards to comparing C++ and Python benchmarks, I wonder if we can't use the existing nvbench_compare script - https://github.com/NVIDIA/nvbench/blob/main/scripts/nvbench_compare.py? IIRC @NaderAlAwar has had some previous experience doing this, so it would be great to have his feedback as well.

Copy link
Copy Markdown
Contributor

@NaderAlAwar NaderAlAwar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @danielfrg! The transform benchmark looks great. Two points:

  1. As @shwina pointed out, the comparison script (and most of the other analysis scripts) already exist in nvbench, so we should not copy them. I am working on releasing a pynvbench wheel, and will figure out a way to expose that functionality properly, Once the wheels are released, we should make pynvbench an optional dependency that you can add when you pip install cuda-cccl.
  2. We should not add the cpp version of the benchmark. The CUB benchmarks already exist under cub/benchmarks/bench/, so in my opinion, we should try to mimic at least one benchmark from each CUB algorithm so we can compare the two properly.

I'm curious about python_vs_cpp_summary.py. Can you show me what the output would look like? Doesn't have to be from a real run, just a visualization

@danielfrg
Copy link
Copy Markdown
Contributor Author

  1. Agree, the nvbench scripts are here at the moment for convenience, if you are able to expose them in the wheels then we can remove. Those work great but I think the python vs c++ one has a but more convenient reports for this case in general, the summary table I have on my first comment is from that script.
  2. Makes sense on trying to reuse the existing benchmarks, for this example one I was just converting the existing python one from pytest. Do we want to convert the ones we have instead? Lets say starting with this one: https://github.com/NVIDIA/cccl/blob/main/cub/benchmarks/bench/transform/fill.cu ?

@NaderAlAwar
Copy link
Copy Markdown
Contributor

I see now, yeah the python vs cpp summary script should stay. I think it would be more valuable to convert the existing C++ ones, and yes fill.cu seems like a good start.

@NaderAlAwar
Copy link
Copy Markdown
Contributor

Also, take a look at https://github.com/NVIDIA/cccl/blob/main/CONTRIBUTING.md#code-formatting-pre-commit-hooks to fix the pre-commit ci issue

@danielfrg
Copy link
Copy Markdown
Contributor Author

I updated the one benchmark here to be based on fill.cu and I got this that in general it looks good to me: For large data sizes we between the 5% target:

…/compute/nvbench cuda-compute-nvbench ? ❯ python analysis/python_vs_cpp_summary.py results/fill_py.json results/fill_cpp.json
# bench_fill

GPU Time: Mean GPU execution time (cold start, pure kernel)
  CUDA events (nvbench tag: nv/cold/time/gpu/mean)
CPU Time: Mean CPU (host) latency
  Host clock (nvbench tag: nv/cold/time/cpu/mean)

## [0] NVIDIA GeForce RTX 4090

### Type: I16

| Elements   | C++ GPU    | Py GPU     | % Slower   | C++ CPU    | Py CPU     | CPU Ovhd   |
|------------|------------|------------|------------|------------|------------|------------|
| 2^16       | 2.754 us   | 4.278 us   | 55.35%     | 14.796 us  | 23.269 us  | 8.474 us   |
| 2^20       | 3.532 us   | 4.972 us   | 40.76%     | 15.715 us  | 23.905 us  | 8.190 us   |
| 2^24       | 34.311 us  | 40.369 us  | 17.65%     | 46.510 us  | 58.837 us  | 12.327 us  |
| 2^28       | 558.510 us | 565.441 us | 1.24%      | 570.950 us | 584.637 us | 13.687 us  |
| 2^32       | 8.952 ms   | 8.958 ms   | 0.06%      | 8.964 ms   | 8.978 ms   | 13.978 us  |

### Type: I32

| Elements   | C++ GPU   | Py GPU    | % Slower   | C++ CPU   | Py CPU    | CPU Ovhd   |
|------------|-----------|-----------|------------|-----------|-----------|------------|
| 2^16       | 2.861 us  | 3.365 us  | 17.64%     | 14.940 us | 21.710 us | 6.770 us   |
| 2^20       | 4.511 us  | 4.837 us  | 7.24%      | 16.391 us | 23.517 us | 7.127 us   |
| 2^24       | 69.059 us | 67.794 us | -1.83%     | 81.073 us | 85.855 us | 4.782 us   |
| 2^28       | 1.118 ms  | 1.120 ms  | 0.21%      | 1.130 ms  | 1.139 ms  | 8.893 us   |

### Type: I64

| Elements   | C++ GPU    | Py GPU     | % Slower   | C++ CPU    | Py CPU     | CPU Ovhd   |
|------------|------------|------------|------------|------------|------------|------------|
| 2^16       | 2.881 us   | 4.293 us   | 48.99%     | 14.974 us  | 23.148 us  | 8.174 us   |
| 2^20       | 8.536 us   | 9.735 us   | 14.04%     | 20.425 us  | 28.462 us  | 8.037 us   |
| 2^24       | 140.137 us | 140.560 us | 0.30%      | 152.051 us | 159.512 us | 7.460 us   |
| 2^28       | 2.239 ms   | 2.239 ms   | -0.00%     | 2.252 ms   | 2.259 ms   | 7.814 us   |

### Type: I8

| Elements   | C++ GPU    | Py GPU     | % Slower   | C++ CPU    | Py CPU     | CPU Ovhd   |
|------------|------------|------------|------------|------------|------------|------------|
| 2^16       | 2.727 us   | 3.350 us   | 22.85%     | 14.533 us  | 21.981 us  | 7.448 us   |
| 2^20       | 3.160 us   | 3.842 us   | 21.59%     | 15.063 us  | 22.308 us  | 7.245 us   |
| 2^24       | 17.222 us  | 22.981 us  | 33.45%     | 29.363 us  | 41.180 us  | 11.817 us  |
| 2^28       | 278.926 us | 285.592 us | 2.39%      | 291.525 us | 304.250 us | 12.725 us  |
| 2^32       | 4.476 ms   | 4.482 ms   | 0.14%      | 4.488 ms   | 4.502 ms   | 13.767 us  |

I had claude to analyze the APIs and give me a mapping for the transform benchmarks:

CUB Benchmark CUB API Python API Status Notes
transform/fill.cu cub::detail::transform::dispatch with return_constant unary_transform + ConstantIterator DONE Using ConstantIterator(42) as 0-ary generator
transform/babelstream.cu cub::detail::transform::dispatch (mul, add, triad, nstream) unary_transform, binary_transform Ready Multiple operations: ascalar, a+b, a+bscalar, a+b+c*scalar
transform/heavy.cu cub::detail::transform::dispatch with heavy compute unary_transform Ready Custom function with configurable compute intensity
transform/fib.cu cub::detail::transform::dispatch unary_transform Ready Fibonacci computation
transform/grayscale.cu cub::detail::transform::dispatch with struct input unary_transform + @gpu_struct Ready RGB struct → grayscale, need gpu_struct for RGB type
transform/complex_cmp.cu cub::detail::transform::dispatch unary_transform Ready Complex comparison logic
transform_reduce/sum.cu cub::detail::reduce::dispatch with transform reduce_into with TransformIterator ⚠️ Workaround Need to use TransformIterator or pre-transform

They look good to me on a quick look. Let me know what you guys prefer in terms of PRs, do we want to keep them all in a single PR? If the structure looks good now I will start with more transform benchmarks.

@danielfrg
Copy link
Copy Markdown
Contributor Author

Changed the C++ build script to use the one in the ci folder and added a single script to run the benchmarks.

I added two extra benchmarks to show how that worked. The babelstream one looks ok in my GPU but the transform_heavy one the performance has big differences:

# heavy

GPU Time: Mean GPU execution time (cold start, pure kernel)
  CUDA events (nvbench tag: nv/cold/time/gpu/mean)
CPU Time: Mean CPU (host) latency
  Host clock (nvbench tag: nv/cold/time/cpu/mean)

## [0] NVIDIA GeForce RTX 4090

### Heaviness=128

| Elements   | C++ GPU    | Py GPU     | % Slower   | C++ CPU    | Py CPU     | CPU Ovhd   |
|------------|------------|------------|------------|------------|------------|------------|
| 2^16       | 6.171 us   | 12.469 us  | 102.07%    | 18.244 us  | 28.769 us  | 10.525 us  |
| 2^20       | 48.865 us  | 145.804 us | 198.38%    | 60.683 us  | 162.066 us | 101.383 us |
| 2^24       | 613.619 us | 2.187 ms   | 256.37%    | 626.126 us | 2.204 ms   | 1.578 ms   |
| 2^28       | 9.894 ms   | 34.658 ms  | 250.30%    | 9.906 ms   | 34.677 ms  | 24.770 ms  |

### Heaviness=256

| Elements   | C++ GPU    | Py GPU     | % Slower   | C++ CPU    | Py CPU     | CPU Ovhd   |
|------------|------------|------------|------------|------------|------------|------------|
| 2^16       | 9.700 us   | 384.247 us | 3861.21%   | 21.617 us  | 400.758 us | 379.141 us |
| 2^20       | 105.618 us | 9.129 ms   | 8543.15%   | 117.345 us | 9.147 ms   | 9.029 ms   |
| 2^24       | 1.541 ms   | 149.250 ms | 9586.40%   | 1.553 ms   | 149.271 ms | 147.718 ms |
| 2^28       | 24.399 ms  | 2.416 s    | 9800.04%   | 24.412 ms  | 2.416 s    | 2.391 s    |

### Heaviness=32

| Elements   | C++ GPU    | Py GPU     | % Slower   | C++ CPU    | Py CPU     | CPU Ovhd   |
|------------|------------|------------|------------|------------|------------|------------|
| 2^16       | 4.316 us   | 5.651 us   | 30.94%     | 16.286 us  | 22.287 us  | 6.001 us   |
| 2^20       | 15.354 us  | 36.968 us  | 140.77%    | 27.240 us  | 53.072 us  | 25.832 us  |
| 2^24       | 174.735 us | 495.926 us | 183.82%    | 186.898 us | 512.593 us | 325.695 us |
| 2^28       | 2.889 ms   | 7.961 ms   | 175.57%    | 2.902 ms   | 7.979 ms   | 5.077 ms   |

### Heaviness=64

| Elements   | C++ GPU    | Py GPU    | % Slower   | C++ CPU    | Py CPU    | CPU Ovhd   |
|------------|------------|-----------|------------|------------|-----------|------------|
| 2^16       | 4.989 us   | 7.757 us  | 55.49%     | 16.932 us  | 24.156 us | 7.224 us   |
| 2^20       | 25.499 us  | 76.779 us | 201.10%    | 37.433 us  | 92.815 us | 55.382 us  |
| 2^24       | 317.748 us | 1.005 ms  | 216.32%    | 329.875 us | 1.022 ms  | 692.030 us |
| 2^28       | 5.204 ms   | 15.931 ms | 206.15%    | 5.216 ms   | 15.949 ms | 10.733 ms  |


Comparison saved to: /home/danielfrg/Documents/cccl/python/cuda_cccl/benchmarks/compute/nvbench/results/heavy_comparison.txt

Any ideas on why this might be happening? My guess would be mostly on the numba kernels or compilation of those?

Copy link
Copy Markdown
Contributor

@NaderAlAwar NaderAlAwar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks Daniel!

Comment on lines +21 to +23
- cuda-bench[cu12]>=0.2.0
- cuda-cccl[cu12]
- cupy-cuda12x
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: is there a way to not hardcode this to cu12?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% (i think) my plan was to have another environment.yml for cuda 13.

@NaderAlAwar
Copy link
Copy Markdown
Contributor

pre-commit.ci autofix

@NaderAlAwar
Copy link
Copy Markdown
Contributor

Any ideas on why this might be happening? My guess would be mostly on the numba kernels or compilation of those?

These results are interesting, we should definitely look into them. I would assume it is something in numba, but I would have to look at the SASS to say for sure. In any case, this should not block the PR. Can you create an issue noting this? Lets maybe have an EPIC issue comparing cuda.compute performance to CUB, and have this specific instance be a subissue

@danielfrg
Copy link
Copy Markdown
Contributor Author

Agree it should not block this one. I will create an epic issue to analyse the results.
I'll continue adding the other transform benchmarks on this PR unless you all prefer something else.

@NaderAlAwar
Copy link
Copy Markdown
Contributor

It's up to you. We can merge this or you can add the other benchmarks first

@danielfrg
Copy link
Copy Markdown
Contributor Author

Ok, then I will add a couple more here!

@danielfrg
Copy link
Copy Markdown
Contributor Author

I think i addressed all the previous comments

Copy link
Copy Markdown
Contributor

@NaderAlAwar NaderAlAwar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmarks look great now, left a few comments to clean things up. It woud be good to run the benchmarks now and compare them to the C++ benchmarks to see where we stand at this point

@danielfrg
Copy link
Copy Markdown
Contributor Author

All done except one question i left about the --quick, i would like to keep it but we can probably refactor into a --config that allows multiple config for the benchmarks for diff scenarios?

@danielfrg
Copy link
Copy Markdown
Contributor Author

Ok since you want to keep the --quick option now we should be mostly good to go.

These are the latest results on my dev machine:

CleanShot 2026-03-19 at 11 17 02@2x

results.zip

I'll open up issues to track the performance of the transforms that we do not match at the moment

@NaderAlAwar
Copy link
Copy Markdown
Contributor

Great work @danielfrg! Could you also open up issues for segmented_sort/keys and histogram/even? Histogram being faster in python is suspicious.

@NaderAlAwar
Copy link
Copy Markdown
Contributor

/ok to test eca064b

@github-actions
Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 1h 26m: Pass: 100%/48 | Total: 15h 13m | Max: 57m 59s

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Benchmarking cuda.compute

4 participants