Fix performance bugs in scalar reductions#509
Fix performance bugs in scalar reductions#509magnatelee merged 7 commits intonv-legate:branch-22.10from
Conversation
* Use unsigned 64-bit integers instead of signed integers wherever possible; CUDA hasn't added an atomic intrinsic for the latter yet. * Move reduction buffers from zero-copy memory to framebuffer. This makes the slow atomic update code path in reduction operators run much more efficiently.
|
Do you you want to also replace the use of |
|
| CHECK_CUDA(cudaMemcpyAsync(ptr_, &identity, sizeof(LHS), cudaMemcpyHostToDevice, stream)); | ||
| } | ||
|
|
||
| __device__ void operator<<=(const RHS& value) const |
There was a problem hiding this comment.
Nit: obviously these things come down to preference and this is just matching legion, but I would personally suggest writing this out as a function name rather than overloading an operator. This appears be doing an atomic reduce. The reduce_output helper function was a little bit difficult to parse with the <<= (a bit-shift operator borrowed for a different purpose) instead of just having a function call say exactly what the code is doing (result.non_exclusive_fold, e.g.)
| using RHS = typename REDOP::RHS; | ||
|
|
||
| public: | ||
| ScalarReductionBuffer(cudaStream_t stream) : buffer_(legate::create_buffer<LHS>(1)) |
There was a problem hiding this comment.
the class name obviously gets annoying long, but consider calling this 'DeviceScalarReductionBuffer' to make it clear this is not a general reduction buffer and is only designed for device reductions.
| using RHS = typename REDOP::RHS; | ||
|
|
||
| public: | ||
| ScalarReductionBuffer(cudaStream_t stream) : buffer_(legate::create_buffer<LHS>(1)) |
There was a problem hiding this comment.
Since this is again only going to run on the device, do we want to explicitly pass GPU_FB_MEM to create_buffer to make it clearer what is happening? Otherwise this is using the default kind = NO_MEMKIND, which seems potentially fragile to rely on get_executing_processor() returning TOC_PROC to allocate this in the right place.
* Unify the template for device reduction tree and do some cleanup * Fix performance bugs in scalar reduction kernels: * Use unsigned 64-bit integers instead of signed integers wherever possible; CUDA hasn't added an atomic intrinsic for the latter yet. * Move reduction buffers from zero-copy memory to framebuffer. This makes the slow atomic update code path in reduction operators run much more efficiently. * Use thew new scalar reduction buffer in binary reductions as well * Use only the RHS type in the reduction buffer as we never call apply * Minor clean up per review * Rename the buffer class and method to make the intent explicit * Flip the polarity of reduce's template parameter
* Unify the template for device reduction tree and do some cleanup * Fix performance bugs in scalar reduction kernels: * Use unsigned 64-bit integers instead of signed integers wherever possible; CUDA hasn't added an atomic intrinsic for the latter yet. * Move reduction buffers from zero-copy memory to framebuffer. This makes the slow atomic update code path in reduction operators run much more efficiently. * Use thew new scalar reduction buffer in binary reductions as well * Use only the RHS type in the reduction buffer as we never call apply * Minor clean up per review * Rename the buffer class and method to make the intent explicit * Flip the polarity of reduce's template parameter
* Unify the template for device reduction tree and do some cleanup * Fix performance bugs in scalar reduction kernels: * Use unsigned 64-bit integers instead of signed integers wherever possible; CUDA hasn't added an atomic intrinsic for the latter yet. * Move reduction buffers from zero-copy memory to framebuffer. This makes the slow atomic update code path in reduction operators run much more efficiently. * Use thew new scalar reduction buffer in binary reductions as well * Use only the RHS type in the reduction buffer as we never call apply * Minor clean up per review * Rename the buffer class and method to make the intent explicit * Flip the polarity of reduce's template parameter Co-authored-by: Wonchan Lee <wonchanl@nvidia.com>
No description provided.