ShuffleReaderExec now supports multiple locations per partition#541
ShuffleReaderExec now supports multiple locations per partition#541alamb merged 3 commits intoapache:masterfrom
Conversation
| self.stream.size_hint() | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
This is moved to ballista-core utils
There was a problem hiding this comment.
this is a cool abstraction -- i have had need of something similar elsewhere -- perhaps it would be good to move to datafusion itself eventually
|
@edrevo fyi |
| Arc::new(self.schema.as_ref().clone()), | ||
| ); | ||
| Ok(Box::pin(result)) | ||
| } |
There was a problem hiding this comment.
This is the main change
| let x = self.partition[partition].clone(); | ||
| let result = future::join_all(x.into_iter().map(fetch_partition)) |
There was a problem hiding this comment.
nit; if you change fetch_partition to take a refernce, you can avoid the .clone:
| let x = self.partition[partition].clone(); | |
| let result = future::join_all(x.into_iter().map(fetch_partition)) | |
| let partition_locations = &self.partition[partition]; | |
| let result = future::join_all(partition_locations.iter().map(fetch_partition)) |
alamb
left a comment
There was a problem hiding this comment.
Looks good to me (though I am not a ballista expert)
| self.stream.size_hint() | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
this is a cool abstraction -- i have had need of something similar elsewhere -- perhaps it would be good to move to datafusion itself eventually
| .collect::<Result<Vec<_>>>()?; | ||
|
|
||
| let result = WrappedStream::new( | ||
| Box::pin(futures::stream::iter(result).flatten()), |
There was a problem hiding this comment.
this is a clever way of flattening the streams (though I wonder if it will serialize them all (aka not start reading from the second until the first is entirely consumed)?
There was a problem hiding this comment.
I wonder if it will serialize them all (aka not start reading from the second until the first is entirely consumed)?
Yes, that's exactly what it does.
There was a problem hiding this comment.
👍
I guess I figured I would point it out (that the different partitions wouldn't be producing in parallel)
There was a problem hiding this comment.
Once the basic shuffle mechanism is implemented, there will be a lot of optimization work to follow
Codecov Report
@@ Coverage Diff @@
## master #541 +/- ##
=======================================
Coverage 76.08% 76.09%
=======================================
Files 156 156
Lines 27035 27048 +13
=======================================
+ Hits 20570 20581 +11
- Misses 6465 6467 +2
Continue to review full report at Codecov.
|
Which issue does this PR close?
Closes #540 .
Rationale for this change
This is a step towards supporting true shuffle.
What changes are included in this PR?
WrappedStreamintoballista-corecrate and adds constructorShuffleReadExecto acceptVec<Vec<PartitionLocation>>instead ofVec<PartitionLocation>Are there any user-facing changes?