Is this a duplicate?
Area
cuda.compute (Python)
Is your feature request related to a problem? Please describe.
Follow up from the new nvbench Python benchmarks and comparison with C++ ones: #7341
We are getting about 90% performance of C++ on the tranform/heavy benchmark.
Describe the solution you'd like
We have a target of 5% performance gap with C++ for large amount of items
Describe alternatives you've considered
No response
Additional context
No response