Re-implement parallel sync using the NBX algorithm. by friedmud · Pull Request #1826 · libMesh/libmesh

friedmud · 2018-08-19T07:16:07Z

Also adds nonblocking_barrier() and possibly_receive()

Switch to a true sparse parallel send/receive using the NBX algorithm from: https://htor.inf.ethz.ch/publications/img/hoefler-dsde-protocols.pdf

That algorithm is similar to "my" algorithm I'm working on - but works better (in my benchmarking) for this case of one-sided, single hop, single data packet, point-to-point transfers.

roystgnr · 2018-08-20T11:05:10Z

The timeout is scary, the NaNs above are scarier. I'll try to get a look at this shortly.

roystgnr · 2018-08-20T15:04:32Z

+  if (this->size() > 1)
+    {
+      LOG_SCOPE("nonblocking_barrier()", "Parallel");
+      libmesh_call_mpi(MPI_Ibarrier (this->get(), req.get()));


MPI_Ibarrier is an MPI3-only method, isn't it? We almost certainly want to keep MPI-2 capability as a fallback, even if builds with MPI-3 work faster. Didn't the code you showed me before use non-blocking reductions rather than non-blocking barriers?

It did - but this is more efficient. I do believe that it's MPI-3.

Why do we want to keep MPI-2? MPI-2 is CRAZY old at this point. Just like we moved on to C++11... we should move on.

MPI-3 is 2012 BTW... and it's WAY easy to upgrade MPI

friedmud · 2018-08-20T17:02:27Z

I have no idea why this is failing BTW: I've been using it for days without any issues...

friedmud · 2018-08-20T17:05:56Z

@roystgnr is it possible that some of the send() specializations aren't honoring the synchronous send flag? That would definitely lead to badness!

roystgnr · 2018-08-23T13:59:06Z

Definitely thought that bug was possible, since IIRC we don't have any test coverage for send_mode(), but grepping through parallel_implementation.h doesn't find anything wrong. Every send implementation either tests send_mode or hands off to a different send implementation; nothing just blindly calls MPI_Send or MPI_Isend.

friedmud · 2018-08-26T15:40:51Z

Hmmm - thanks for checking. I don't have time to look into these failures right now - but I might be able to get to them this week.

…

On Thu, Aug 23, 2018 at 9:59 AM roystgnr ***@***.***> wrote: Definitely thought that bug was possible, since IIRC we don't have any test coverage for send_mode(), but grepping through parallel_implementation.h doesn't find anything wrong. Every send implementation either tests send_mode or hands off to a different send implementation; nothing just blindly calls MPI_Send or MPI_Isend. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1826 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA1JMZnpiHuTSLtsUcL3IS5WmppfnYnrks5uTrVWgaJpZM4WC3A2> .

moosebuild · 2018-11-20T21:55:31Z

Job Test on c024b30 : invalidated by @friedmud

Also adds nonblocking_barrier() and possibly_receive()

friedmud · 2018-11-21T07:43:36Z

@roystgnr check this out. I think this is good to go now.

I didn't implement the vector<vector<T>> case yet. Do you see a way to unify those two at all - or will it just need to pretty much be a duplication? I tried to see if I could get a unified one to go... but I couldn't come up with the correct template foo to make it happen.

Note that I changed the way unique tag IDs are chosen now. The reason is fairly subtle... but basically you need two adjacent calls to push/pull_parallel_vector_data() to use definitely different tag IDs.

I went with @fdkong and looked at what PETSc does for this... and they do something close to what I've coded here (except that they count down and they don't bother to keep a list of currently used ones... mostly just hoping that if they wrap around there are no collisions).

In my testing yesterday this algorithm can make a BIG impact at 200M Dofs on 4k+ procs.

roystgnr · 2018-11-21T15:06:47Z

Thanks!

@roystgnr check this out.

Will do now, literally. Multiple git checkouts, even. I'd like to run on a few different systems and make sure the "works on one system, breaks on another" sort of issues are really gone.

Assuming I can't shake loose any problems, we'll want to merge this ASAP. But at some point I may be tempted to add back "--sync_algorithm old" and "--sync_algorithm older" command line options for debugging and benchmarking purposes.

roystgnr · 2018-11-21T16:13:48Z

I am seeing hangs. Going to see if I can get a sweep to trigger them via Civet too.

roystgnr · 2018-11-21T16:30:53Z

Huh. At least my first hang seems like it's an unrelated bug that my tests just happened to trigger now. I'll put in a fix in a different PR and keep going.

roystgnr · 2018-11-21T22:55:48Z

      pushed_keys_vals_to_me[pid] = data;
    };

-  // Trade pushed dof constraint rows


Why'd this vanish?

Well - I did have some back and forth on revisions that edited this part of code - just a messup on my part to leave this like this though. I'll fix it.

roystgnr · 2018-11-21T23:09:23Z

I'm seeing one more failure, but I can't reproduce it everywhere, and I can get the same failure from a different branch. If you want this in master urgently I'm okay with merging now, but I'd prefer to have a little more time to look at it over the holiday.

friedmud · 2018-11-21T23:39:49Z

Interesting - I'm not in any hurry... I'll be banging on this patch myself this weekend.

We were able to run some pretty big jobs with this patch today already... but it needs to be perfect before we put it in so let me know if you see something....

friedmud · 2018-11-21T23:40:16Z

Also: thanks for looking at it!

roystgnr · 2018-11-30T19:53:07Z

+  // tag that is in use because we have
+  // ~2 billion of them
+  while(used_tag_values.count(new_tag))
+    new_tag++;


Ignoring tagvalue prevents use cases where the user wants to request tags in different orders on different processors; e.g. to use a task management system where ranks grab new bits of work as they complete prior tasks. We don't actually do that or intend to at any time in the near future, but it was the original reasoning.

So what's the reasoning behind removing it? We already had a "grab the next larger tag if the one we looked at first is already in use" mechanism. Why didn't it work for your case?

friedmud · 2018-12-01T00:22:17Z

Fair - I hadn't thought about that use-case. Here's what doesn't work for my case: in completely non-blocking world... it can be the case that two adjacent calls of push or pull end up getting messages crossed. The reason why is super subtle... but let me see if I can explain:

The issue is that synchronous sends only guarantee that they return as "finished" when the receiver starts receiving the message... NOT when the message has been completely received.

So here's how it can play out with two processors (0 and 1) with two push()es in a row where proc0 needs to send to proc1 both times (but the data is different in each push())

(I was going to try to draw this - but it's actually even harder to draw than it is to just write out the list of events.

proc0 calls push() grabs tag 1234
proc1 calls push() grabs tag 1234
proc1 hits the ibarrier because it's not sending anything
proc0 initiates an issend to proc1
proc1 sees that send using 1234 and starts an irecv
proc0 is notified that proc1 is receiving - finishes the issend
proc0 hits the ibarrier - and completes it
proc1 is still working on the receive (meaning it's still looping, looking for additional messages and checking to see if the irecv is finished yet
proc0 exits push()
proc0 releases tag 1234 back into the pool because the function is over
proc0 starts the next push()
proc0 grabs 1234 again (because that's what's hardcoded for it to try to grab - and it's available again!)
proc1 is still working on the receive...
proc0 starts a new issend() using tag 1234 again
proc1 (still in the first call to push()!) sees that send on a tag it's still looking for! It starts another irecv() to pull it in! Oh damn! Now proc1 has received two messages within the same call to push() when it was supposed to receive one message in two pushes()
EVERYTHING DIES because proc1 is going to try to do something invalid with the second message...

Non-blocking land is a tricky beast! I've had my head in it now for nearly 3 years... and it' still tough to spot things like this!

So: this would all be fixed if the two push() calls in a row used different tags! If proc0 got 1235 for a tag in the second invocation of push()... then proc1 would totally ignore that message while it finished up receiving the first one... then it would exit the first push() and start the second one and do the right thing.

It should be noted again that the scheme that I've added here for managing tags is very closely related to what PETSc does... and I believe they probably do it for a very similar reason...

Also: there are other mitigation strategies (you could add true barrier() at the end of push() for instance)... but they are all slower. We want to allow proc0 to race ahead and start sending more messages... it might be sending to other processors who have already finished that first push() and they can start receiving... all while proc1 is still wrestling with that first message.

ALSO: note that this is not just an academic exercise. It's seriously easy to get this (I actually was!) if your messages are of a decent size and there are lot of senders...

roystgnr · 2018-12-03T16:56:28Z

Well, damn, that was a good answer.

And now I see the obvious - the old code was incrementing for ids already in use, then going back to the requested id as soon as it thought it could; the new code is incrementing for ids already in use but that's basically pointless because it stays incremented until wrapping around MAX_INT.

I still don't want to break anyone who decided to try a isend(tag=1); isend(tag=2); recv(tag=2); recv(tag=1) if we don't have to... but now I'm much less confident that we don't have to.

Could we make user tag requests an optional feature?

Make the get_unique_tag default value be MessageTag::invalid_tag
If we see invalid_tag we use _next_tag++ (or increment that further if that's already taken)
If we see a user-requested tag we use it (or increment it further if that's already taken)
Initialize _next_tag to 1B, so user requests are less likely to conflict with _next_tag
Make every libMesh internal get_unique_tag() call use the default, so even if we use those tags for asynchronous stuff we have fewer internal "user" requests likely to conflict with user requests.

roystgnr · 2018-12-04T21:07:04Z

+        int flag;
+        libmesh_call_mpi(MPI_Comm_get_attr(MPI_COMM_WORLD, MPI_TAG_UB, &maxTag, &flag));
+
+        _max_tag = *maxTag;


This appears to be incorrect. On MPICH 3.2 I get -4026531841 this way, which doesn't fit in int and isn't even positive. The MPI_Comm_get_attr man page seems to think we need to be passing in a void* and then casting the pointer to int* before dereferencing.

I'll do that; I've got a fork of this branch in which I'm trying out my "see if I can retain backwards tagvalue compatibility without breaking Derek's stuff" plan.

roystgnr · 2018-12-04T22:50:25Z

-  //   libmesh_assert_equal_to (tagvalue, maxval);
-  // #endif
+  used_tag_values[new_tag] = 1;



I dug into SVN history to see why this was commented out, and apparently Ben had a use case in which it was expected to fail, because tags were being grabbed on a subset of processors. So I feel a little better about going to the effort of respecting tagvalue where it's used while avoiding using it where possible.

Actually, that two-part system ought to make sanity checking easier. Where the user requests an automatic tag value, we can be sure they want to keep _next_tag in sync, and we can restore this check.

roystgnr · 2019-11-12T20:26:17Z

This algorithm made it in via #1965

roystgnr reviewed Aug 20, 2018

View reviewed changes

pbauman mentioned this pull request Aug 20, 2018

MPI-3 #1827

Open

roystgnr mentioned this pull request Aug 21, 2018

Add version-check of MPI in configure #1822

Merged

roystgnr mentioned this pull request Oct 1, 2018

Macro tweaks #1868

Merged

friedmud force-pushed the nbx_push_vectors branch 4 times, most recently from e721c50 to 0234ad5 Compare November 21, 2018 07:01

Re-implement parallel sync using the NBX algorithm.

8e7df37

Also adds nonblocking_barrier() and possibly_receive()

friedmud force-pushed the nbx_push_vectors branch 3 times, most recently from cccbf84 to 6c9e18d Compare November 21, 2018 07:21

friedmud added 2 commits November 21, 2018 00:35

Change how communicator chooses unique message tags

b53905c

Finish implementation of NBX and use it for both push and pull

6d1a83a

friedmud force-pushed the nbx_push_vectors branch from 6c9e18d to 6d1a83a Compare November 21, 2018 07:35

roystgnr reviewed Nov 21, 2018

View reviewed changes

roystgnr reviewed Nov 30, 2018

View reviewed changes

roystgnr reviewed Dec 4, 2018

View reviewed changes

roystgnr mentioned this pull request Dec 4, 2018

Re-implement parallel sync using the NBX algorithm #1965

Merged

roystgnr closed this Nov 12, 2019

Conversation

friedmud commented Aug 19, 2018

Uh oh!

roystgnr commented Aug 20, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

friedmud commented Aug 20, 2018

Uh oh!

friedmud commented Aug 20, 2018

Uh oh!

roystgnr commented Aug 23, 2018

Uh oh!

friedmud commented Aug 26, 2018 via email

Uh oh!

moosebuild commented Nov 20, 2018

Uh oh!

friedmud commented Nov 21, 2018

Uh oh!

roystgnr commented Nov 21, 2018

Uh oh!

roystgnr commented Nov 21, 2018

Uh oh!

roystgnr commented Nov 21, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roystgnr commented Nov 21, 2018

Uh oh!

friedmud commented Nov 21, 2018

Uh oh!

friedmud commented Nov 21, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

friedmud commented Dec 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roystgnr commented Dec 3, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roystgnr commented Nov 12, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

friedmud commented Dec 1, 2018 •

edited

Loading