Make Parquet SBBF serialize/deserialize helpers public for external reuse#8762
Make Parquet SBBF serialize/deserialize helpers public for external reuse#8762alamb merged 11 commits intoapache:mainfrom
Conversation
alamb
left a comment
There was a problem hiding this comment.
Thank you @RoseZhang123 -- this is a very nice contribution
I think we need a doc example to serve as documentation and illustration of the usecase, but otherwise this is great
alamb
left a comment
There was a problem hiding this comment.
Thank you @RoseZhang123 -- this is looking quite close
I have some comments / suggestions about the API. Let me know if it doesn't make sense.
parquet/src/bloom_filter/mod.rs
Outdated
| } | ||
|
|
||
| /// Returns the raw bitset bytes encoded in little-endian order. | ||
| pub fn as_slice(&self) -> Vec<u8> { |
There was a problem hiding this comment.
typically methods named starting with as_* would not copy data
|
@alamb Checking to see if you have any more comments/questions on this PR? |
Hi @RoseZhang123 -- I had a few suggestions on the API here: #8762 (review) If that doesn't make sense, let me know and I can try and propose some changes |
b65d207 to
c4de23e
Compare
etseidl
left a comment
There was a problem hiding this comment.
Thanks @ODukhno (and @RoseZhang123). I think I understand the use case now. Just a few documentation nits, but nothing blocking.
parquet/src/bloom_filter/mod.rs
Outdated
| read_bloom_filter_header_and_length_from_bytes(buffer.as_ref()) | ||
| } | ||
|
|
||
| /// given a byte slice, try to read out a bloom filter header and return both the header and |
There was a problem hiding this comment.
| /// given a byte slice, try to read out a bloom filter header and return both the header and | |
| /// Given a byte slice, try to read out a bloom filter header and return both the header and |
parquet/src/bloom_filter/mod.rs
Outdated
| /// flush the writer in order to boost performance of bulk writing all blocks. Caller | ||
| /// must remember to flush the writer. | ||
| pub(crate) fn write<W: Write>(&self, mut writer: W) -> Result<(), ParquetError> { | ||
| /// This method usually is used in conjunction with from_bytes for serialization/deserialization. |
There was a problem hiding this comment.
| /// This method usually is used in conjunction with from_bytes for serialization/deserialization. | |
| /// This method usually is used in conjunction with [`Self::from_bytes`] for serialization/deserialization. |
parquet/src/bloom_filter/mod.rs
Outdated
| self.0.capacity() * std::mem::size_of::<Block>() | ||
| } | ||
|
|
||
| /// reads a Sbff from thrift encoded bytes |
There was a problem hiding this comment.
| /// reads a Sbff from thrift encoded bytes | |
| /// Reads a Sbff from Thrift encoded bytes |
parquet/src/bloom_filter/mod.rs
Outdated
| // Note: bloom filters can have false positives, but should never have false negatives | ||
| // So we can't assert !check(), but we should verify inserted values are found | ||
| let _ = reconstructed.check(value); // Just exercise the code path |
There was a problem hiding this comment.
Since we can't verify the negative test, why not just move these comments up and skip this loop?
There was a problem hiding this comment.
Just pushed another change where I applied this one along with all other suggestions above.
Thanks!
There was a problem hiding this comment.
Looks good to me -- thank you @RoseZhang123 and @ODukhno
|
since @etseidl has already approved this too I'll just merge it in Thanks again! |
Which issue does this PR close?
Rationale for this change
Explained in the issue #8727 .
What changes are included in this PR?
Make the following method signatures public:
Are these changes tested?
Added unit tests for them.
Are there any user-facing changes?
Users is now able to deserialize SBBFs straight from storage and re-serialize them form raw bytes.