Implement custom fold for ZipLongest#764
Conversation
|
I'm not very familiar with benchmarks but I don't think we can expect it to be two times faster. PS: As you can see here, you needed to run EDIT (after response): Ah 1024 and 768 are just mirrored, ok. I did not expect such difference (22% vs 5%) for such change. |
Sorry - by "proposed benchmark", I mean the |
jswrenn
left a comment
There was a problem hiding this comment.
Thank you so much for this! I've left a comment on how to further optimize fold. Could you also add a test for the specialization in tests/specializations.rs? Something like this should do the trick:
quickcheck! {
fn intersperse(v: Vec<u8>) -> () {
test_specializations(&v.into_iter().intersperse(0));
}
+
+ fn zip_longest(a: Vec<u8>, b: Vec<u8>) -> () {
+ test_specializations(&a.into_iter().zip_longest(b))
+ }
}| #[inline] | ||
| fn fold<B, F>(self, mut acc: B, mut f: F) -> B | ||
| where | ||
| Self: Sized, | ||
| F: FnMut(B, Self::Item) -> B, | ||
| { | ||
| let ZipLongest { mut a, mut b } = self; | ||
|
|
||
| loop { | ||
| match (a.next(), b.next()) { | ||
| (Some(x), Some(y)) => acc = f(acc, EitherOrBoth::Both(x, y)), | ||
| (Some(x), None) => { | ||
| acc = f(acc, EitherOrBoth::Left(x)); | ||
| // b is exhausted, so we can drain a. | ||
| return a.fold(acc, |acc, x| f(acc, EitherOrBoth::Left(x))); | ||
| } | ||
| (None, Some(y)) => { | ||
| acc = f(acc, EitherOrBoth::Right(y)); | ||
| // a is exhausted, so we can drain b. | ||
| return b.fold(acc, |acc, y| f(acc, EitherOrBoth::Right(y))); | ||
| } | ||
| (None, None) => return acc, // Both iterators are exhausted. | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
When specializing fold, try to avoid for loops and calls to next() on underlying iterators. If those underlying iterators also specialize fold, then for looping and calling next() bypasses their optimizations.
It's possible to implement this method in terms of fold:
| #[inline] | |
| fn fold<B, F>(self, mut acc: B, mut f: F) -> B | |
| where | |
| Self: Sized, | |
| F: FnMut(B, Self::Item) -> B, | |
| { | |
| let ZipLongest { mut a, mut b } = self; | |
| loop { | |
| match (a.next(), b.next()) { | |
| (Some(x), Some(y)) => acc = f(acc, EitherOrBoth::Both(x, y)), | |
| (Some(x), None) => { | |
| acc = f(acc, EitherOrBoth::Left(x)); | |
| // b is exhausted, so we can drain a. | |
| return a.fold(acc, |acc, x| f(acc, EitherOrBoth::Left(x))); | |
| } | |
| (None, Some(y)) => { | |
| acc = f(acc, EitherOrBoth::Right(y)); | |
| // a is exhausted, so we can drain b. | |
| return b.fold(acc, |acc, y| f(acc, EitherOrBoth::Right(y))); | |
| } | |
| (None, None) => return acc, // Both iterators are exhausted. | |
| } | |
| } | |
| } | |
| #[inline] | |
| fn fold<B, F>(self, mut init: B, mut f: F) -> B | |
| where | |
| Self: Sized, | |
| F: FnMut(B, Self::Item) -> B, | |
| { | |
| let ZipLongest { a, mut b } = self; | |
| init = a.fold(init, |init, a| match b.next() { | |
| Some(b) => f(init, EitherOrBoth::Both(a, b)), | |
| None => f(init, EitherOrBoth::Left(a)), | |
| }); | |
| b.fold(init, |init, b| f(init, EitherOrBoth::Right(b))) | |
| } |
There was a problem hiding this comment.
Thanks @jswrenn for your feedback. I tried to run the benchmark against your recommendation but the result is regressing
time: [594.12 ns 594.85 ns 595.65 ns]
change: [+20.625% +21.035% +21.434%] (p = 0.00 < 0.05)
Performance has regressed.
There was a problem hiding this comment.
I was gonna suggest to try to rely a bit on zip from the std with init = a.by_ref().zip(b.ref()).fold(init, &mut f); ... but Zip do not have a fold specialization yet. It would allow to discard the loop (and the Some, Some case).
What about f(init, match b.next() {...}) (in the first fold)? (I guess it's the same?)
There was a problem hiding this comment.
Yes unfortunately. I tried :
{
let ZipLongest { a, mut b } = self;
init = a.fold(init, |init, a| f(init, match b.next() {
Some(b) => EitherOrBoth::Both(a, b),
None => EitherOrBoth::Left(a),
}));
b.fold(init, |init, b| f(init, EitherOrBoth::Right(b)))
}
and the result is same: time: [593.41 ns 594.82 ns 596.39 ns]
There was a problem hiding this comment.
In #765 I considered folding a range k+1..=n but did not because it was slower, which surprised me. It seems weird to even suggest but maybe it's because of Range*::fold?!
Since folds in the benchmark rely on the same thing here, maybe you can try with different iterators.
Like let v1 = black_box((0..***).collect_vec()); v2... (before bench_function) and then v1.iter().zip_longest(v2.iter()).
There was a problem hiding this comment.
@jswrenn Thanks a lot! Yes after looking at the 2 files you shared, I can understand why for loop should always be avoided (and use the underlying fold api which has been optimized). That's really inspiring to me. I've incorporated your suggestions just now. Thanks a lot!
@Philippe-Cholet Yes the issue is on Range*::fold. I adopted your suggestions and rerun the benchmark. Using the default fold (which calls next) the time is time: [2.0522 µs 2.0558 µs 2.0599 µs], and using the new fold logic, I can see a huge improvement time: [414.34 ns 415.44 ns 416.75 ns]
Thanks both!
There was a problem hiding this comment.
Diving... Apparently, Range (a..b) does not specialize fold (so maybe that's why there was bad perf), but RangeInclusive (a..=b) does relying on try_fold (implementation seems tricky to me). std::vec::IntoIter does not either, but std::slice::Iter which I redirected you to (without much insight) does.
So I guess we should continue with some_vec.iter().fold(...) for benchmarks.
There was a problem hiding this comment.
Indeed. In future PR that optimizes fold, we should stick with some_vec.iter().fold(...) for benchmarking
There was a problem hiding this comment.
@jswrenn Do you have any further feedback for my PR ? If no I think we can merge this.
There was a problem hiding this comment.
Usual link for how things should prefer forwarding internal iteration to internal iteration where possible: https://medium.com/@veedrac/rust-is-slow-and-i-am-the-cure-32facc0fdcb.
chain is the usual example where internal iteration tends to be measurably faster than external. For things like Range<usize> or slice::Iter, there's typically no difference.
(So this is my +1 to forwarding fold to fold where possible, and when we finally make it possible to implement try_fold, having that forward to try_fold.)
3e5395d to
575819a
Compare
575819a to
c4358ab
Compare
|
Sigh. While attempting to fix a spurious merge conflict error, I reset the head back slightly too far. Github autoclosed the PR, and now I don't have edit access. @Owen-CH-Leung Can you just run |
As per discussion in #755 , this PR implements custom fold logic for ZipLongest<T, U> and create benchmark test to measure the performance gain compared with the default fold.
Running the proposed benchmark using the default
foldfrom iterator, the time wastime: [592.01 ns 593.24 ns 594.70 ns]Running the proposed benchmark using the proposed
foldlogic, the time wastime: [486.75 ns 487.88 ns 489.29 ns]But the performance gain is not so significant if the setup changes to
(0..1024).zip_longest(0..768)using the default
foldlogic:time: [509.25 ns 510.21 ns 511.35 ns]using the proposed
foldlogic:time: [485.58 ns 486.24 ns 486.97 ns]