ARROW-10040: [Rust] Add slice that realigns Buffer#8223
ARROW-10040: [Rust] Add slice that realigns Buffer#8223nevi-me wants to merge 2 commits intoapache:masterfrom
Conversation
Has the consequence of removing the alignment limit on bool kernels. It however comes at the cost of slower buffer manipulation.
|
@jhorstmann @paddyhoran this is related to the alignment fixes made recently. I noticed while reviewing another PR that we had a limitation on boolean kernels if offsets weren't a multiple of 8. So I've implemented a slice on I expect there to be a minor perf impact, and I only use the above slice method when necessary. It's very likely that my implementation can be improved upon, but I'm at my current limits on bit manipulation. Please see https://godbolt.org/z/bMc5qd for some of the assembly generated. Not happy with the instruction count, and would appreciate some guidance or improvements. Feel free to push any changes directly on this branch. |
|
Hi @nevi-me I guess I have to thank you for pushing this topic. I played around today with a different approach today that uses an iterator over the validity mask and returns bits in chunks of u8/u16/u32/u64. This can then be used together with a chunked iterator over the typed data of buffer to efficiently implement kernels. So far it's more of a proof of concept because it doesn't handle the remainder bits of this chunking yet. It's also largely untested. Generated code for a kernel looks very very nice, the inner loop gets unrolled and it even uses avx512 mask registers if available: https://rust.godbolt.org/z/qGcrW8 This approach could also be used with explicit vector instructions in the inner loop, and a scalar loop for the remainder. Ensuring no out of bounds reads if the chunk size matches the vector register size. Big todos are the remainder part and a lot of tests. |
|
My proof of concept is available at https://github.com/jhorstmann/bititer-poc/blob/master/src/lib.rs but I'm not sure when I will have time to integrate it with arrow. |
|
@jhorstmann can I close this PR, and rely on your implementation when ready? Also, do you think we'd be able to use your implementation in |
|
@nevi-me can you point me to the part of the parquet code that you have in mind? I found the |
|
Hey @jhorstmann, I haven't had time to look, but maybe I'm confused. What I recall is that I needed a way of converting a Buffer to a I'm closing this PR |
…itrary offsets @nevi-me this is the chunked iterator based approach i mentioned in #8223 I'm not fully satisfied with the solution yet: - I'd prefer to move all the bit-based functions into `Bitmap`, but creating a `Bitmap` from a `&Buffer` would involve cloning an `Arc`. - I need to do some benchmarking about how much the `packed_simd` implementation actually helps. If it's not a big difference I'd propose to remove it to simplify the code. Closes #8262 from jhorstmann/ARROW-10040-unaligned-bit-buffers Authored-by: Jörn Horstmann <joern.horstmann@signavio.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
Has the consequence of removing the alignment limit on bool kernels.
It however comes at the cost of slower buffer manipulation.