Is this a duplicate?
Area
cuda.compute (Python)
Is your feature request related to a problem? Please describe.
Follow up from the new nvbench Python benchmarks and comparison with C++ ones: #7341
We are getting more than C++ performance! 107% (aggregated numbers).
So we should look more into that.
Describe the solution you'd like
Mabye something is not fully matching on the python side.
Describe alternatives you've considered
No response
Additional context
No response