Use BufReader for LocalFileReader to revert performance regression in parquet reading#1366
Conversation
BufRead for ChunkObjectReader to improve performanceBufRead for ChunkObjectReader to revert performance regression in parquet reading
alamb
left a comment
There was a problem hiding this comment.
Thank you @Dandandan -- I am not sure about the need for the BufRead trait, but otherwise looks good to me.
Very nice 🕵️ work
|
|
||
| use std::fs::{self, File, Metadata}; | ||
| use std::io::{Read, Seek, SeekFrom}; | ||
| use std::io::{BufRead, BufReader, Read, Seek, SeekFrom}; |
| file.seek(SeekFrom::Start(start))?; | ||
| Ok(Box::new(file.take(length as u64))) | ||
|
|
||
| let file = BufReader::new(file.take(length as u64)); |
There was a problem hiding this comment.
It seems like this is the actual fix, right? Is the change to require the BufRead trait needed?
There was a problem hiding this comment.
Not sure. Rerunning some benchmarks now without the trait.
There was a problem hiding this comment.
Yes - looks like BufRead wasn't needed 🎉
BufRead for ChunkObjectReader to revert performance regression in parquet readingBufReader for ChunkObjectReader to revert performance regression in parquet reading
BufReader for ChunkObjectReader to revert performance regression in parquet readingBufReader for LocalFileReader to revert performance regression in parquet reading
|
🎉 |
|
Thanks @Dandandan ! Can you quickly explain what the reason for the slowdown was exactly? |
As far as I can explain: The earlier code used the By introducing the object storage abstraction, we were directly reading from a By wrapping the Maybe a potential improvement would be having a bit more control, such as setting the capacity of the buffer. |
Which issue does this PR close?
Closes #1363
Rationale for this change
Parquet reading was slow. This recovers the performance regression in the TPC-H benchmark.
There is still a slowdown in query 10 - and other queries, but this is unrelated to Parquet reading #1367 (and performance still improves from roughly 10 to 7s on that query).
What changes are included in this PR?
Use theBufReadtrait instead ofRead`.Are there any user-facing changes?