-
Notifications
You must be signed in to change notification settings - Fork 130
Description
We had a conversation on IRC with @tomsmeding about that it would be great to have different strategies available for executing the permute array scatters. Especially for cases when the "scatter" actually compresses the array (histogram-style use) and on GPUs, the locks may hurt quite a lot.
A canonical way to remove (or reduce) the locking is to split the problem into a few smaller permutes that scatter the data into a few "privatized" arrays, and doing a simple reduction later.
Because the optimal configuration for that scheme is often very unclear (depends on too many cache-related factors), I wouldn't really like to have permute try to solve this automagically, but more like to expose some of the internals where I'd be able to tell Accelerate to do the permute in a specific way. E.g., "allocate" the matrices myself, and have the computation separated into proper threads. (Chances are that this can be expressed in current Accelerate without new features, I just don't know how :D )
As the general target, I'd love to support what @krulis-martin did in this paper: https://dl.acm.org/doi/abs/10.1145/3404397.3404426 (if you don't have ACM access, the PDF scihubs well). He used a combination of the privatization+locking, see figs 8 and 9 in the paper.
Optional challenge/question: Specifically for the k-means usecase, the permute looks like it's not scattering single elements but whole blocks of data (vectors of sufficiently chunky dimensions to feed some SIMD). Is it possible (or necessary?) to tell Accelerate that the locking can be done on a whole SIMD-block level, thus possibly saving some synchronization time?
Thanks for any ideas!
-mk
PS. I've got Accelerate source code for the above, will hopefully publish it in a few days and link here so that everyone can see the this in action.