Re-implement parallel sync using the NBX algorithm.#1826
Conversation
|
The timeout is scary, the NaNs above are scarier. I'll try to get a look at this shortly. |
| if (this->size() > 1) | ||
| { | ||
| LOG_SCOPE("nonblocking_barrier()", "Parallel"); | ||
| libmesh_call_mpi(MPI_Ibarrier (this->get(), req.get())); |
There was a problem hiding this comment.
MPI_Ibarrier is an MPI3-only method, isn't it? We almost certainly want to keep MPI-2 capability as a fallback, even if builds with MPI-3 work faster. Didn't the code you showed me before use non-blocking reductions rather than non-blocking barriers?
There was a problem hiding this comment.
It did - but this is more efficient. I do believe that it's MPI-3.
Why do we want to keep MPI-2? MPI-2 is CRAZY old at this point. Just like we moved on to C++11... we should move on.
There was a problem hiding this comment.
MPI-3 is 2012 BTW... and it's WAY easy to upgrade MPI
|
I have no idea why this is failing BTW: I've been using it for days without any issues... |
|
@roystgnr is it possible that some of the |
|
Definitely thought that bug was possible, since IIRC we don't have any test coverage for send_mode(), but grepping through parallel_implementation.h doesn't find anything wrong. Every send implementation either tests send_mode or hands off to a different send implementation; nothing just blindly calls MPI_Send or MPI_Isend. |
|
Hmmm - thanks for checking. I don't have time to look into these failures
right now - but I might be able to get to them this week.
…On Thu, Aug 23, 2018 at 9:59 AM roystgnr ***@***.***> wrote:
Definitely thought that bug was possible, since IIRC we don't have any
test coverage for send_mode(), but grepping through
parallel_implementation.h doesn't find anything wrong. Every send
implementation either tests send_mode or hands off to a different send
implementation; nothing just blindly calls MPI_Send or MPI_Isend.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1826 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA1JMZnpiHuTSLtsUcL3IS5WmppfnYnrks5uTrVWgaJpZM4WC3A2>
.
|
e721c50 to
0234ad5
Compare
Also adds nonblocking_barrier() and possibly_receive()
cccbf84 to
6c9e18d
Compare
6c9e18d to
6d1a83a
Compare
|
@roystgnr check this out. I think this is good to go now. I didn't implement the Note that I changed the way unique tag IDs are chosen now. The reason is fairly subtle... but basically you need two adjacent calls to push/pull_parallel_vector_data() to use definitely different tag IDs. I went with @fdkong and looked at what PETSc does for this... and they do something close to what I've coded here (except that they count down and they don't bother to keep a list of currently used ones... mostly just hoping that if they wrap around there are no collisions). In my testing yesterday this algorithm can make a BIG impact at 200M Dofs on 4k+ procs. |
|
Thanks!
Will do now, literally. Multiple git checkouts, even. I'd like to run on a few different systems and make sure the "works on one system, breaks on another" sort of issues are really gone. Assuming I can't shake loose any problems, we'll want to merge this ASAP. But at some point I may be tempted to add back "--sync_algorithm old" and "--sync_algorithm older" command line options for debugging and benchmarking purposes. |
|
I am seeing hangs. Going to see if I can get a sweep to trigger them via Civet too. |
|
Huh. At least my first hang seems like it's an unrelated bug that my tests just happened to trigger now. I'll put in a fix in a different PR and keep going. |
| pushed_keys_vals_to_me[pid] = data; | ||
| }; | ||
|
|
||
| // Trade pushed dof constraint rows |
There was a problem hiding this comment.
Well - I did have some back and forth on revisions that edited this part of code - just a messup on my part to leave this like this though. I'll fix it.
|
I'm seeing one more failure, but I can't reproduce it everywhere, and I can get the same failure from a different branch. If you want this in master urgently I'm okay with merging now, but I'd prefer to have a little more time to look at it over the holiday. |
|
Interesting - I'm not in any hurry... I'll be banging on this patch myself this weekend. We were able to run some pretty big jobs with this patch today already... but it needs to be perfect before we put it in so let me know if you see something.... |
|
Also: thanks for looking at it! |
| // tag that is in use because we have | ||
| // ~2 billion of them | ||
| while(used_tag_values.count(new_tag)) | ||
| new_tag++; |
There was a problem hiding this comment.
Ignoring tagvalue prevents use cases where the user wants to request tags in different orders on different processors; e.g. to use a task management system where ranks grab new bits of work as they complete prior tasks. We don't actually do that or intend to at any time in the near future, but it was the original reasoning.
So what's the reasoning behind removing it? We already had a "grab the next larger tag if the one we looked at first is already in use" mechanism. Why didn't it work for your case?
|
Fair - I hadn't thought about that use-case. Here's what doesn't work for my case: in completely non-blocking world... it can be the case that two adjacent calls of The issue is that synchronous sends only guarantee that they return as "finished" when the receiver starts receiving the message... NOT when the message has been completely received. So here's how it can play out with two processors (0 and 1) with two (I was going to try to draw this - but it's actually even harder to draw than it is to just write out the list of events.
Non-blocking land is a tricky beast! I've had my head in it now for nearly 3 years... and it' still tough to spot things like this! So: this would all be fixed if the two It should be noted again that the scheme that I've added here for managing tags is very closely related to what PETSc does... and I believe they probably do it for a very similar reason... Also: there are other mitigation strategies (you could add true ALSO: note that this is not just an academic exercise. It's seriously easy to get this (I actually was!) if your messages are of a decent size and there are lot of senders... |
|
Well, damn, that was a good answer. And now I see the obvious - the old code was incrementing for ids already in use, then going back to the requested id as soon as it thought it could; the new code is incrementing for ids already in use but that's basically pointless because it stays incremented until wrapping around MAX_INT. I still don't want to break anyone who decided to try a Could we make user tag requests an optional feature?
|
| int flag; | ||
| libmesh_call_mpi(MPI_Comm_get_attr(MPI_COMM_WORLD, MPI_TAG_UB, &maxTag, &flag)); | ||
|
|
||
| _max_tag = *maxTag; |
There was a problem hiding this comment.
This appears to be incorrect. On MPICH 3.2 I get -4026531841 this way, which doesn't fit in int and isn't even positive. The MPI_Comm_get_attr man page seems to think we need to be passing in a void* and then casting the pointer to int* before dereferencing.
I'll do that; I've got a fork of this branch in which I'm trying out my "see if I can retain backwards tagvalue compatibility without breaking Derek's stuff" plan.
| // libmesh_assert_equal_to (tagvalue, maxval); | ||
| // #endif | ||
| used_tag_values[new_tag] = 1; | ||
|
|
There was a problem hiding this comment.
I dug into SVN history to see why this was commented out, and apparently Ben had a use case in which it was expected to fail, because tags were being grabbed on a subset of processors. So I feel a little better about going to the effort of respecting tagvalue where it's used while avoiding using it where possible.
Actually, that two-part system ought to make sanity checking easier. Where the user requests an automatic tag value, we can be sure they want to keep _next_tag in sync, and we can restore this check.
|
This algorithm made it in via #1965 |
Also adds nonblocking_barrier() and possibly_receive()
Switch to a true sparse parallel send/receive using the NBX algorithm from: https://htor.inf.ethz.ch/publications/img/hoefler-dsde-protocols.pdf
That algorithm is similar to "my" algorithm I'm working on - but works better (in my benchmarking) for this case of one-sided, single hop, single data packet, point-to-point transfers.