Skip to content

Re-implement parallel sync using the NBX algorithm.#1826

Closed
friedmud wants to merge 3 commits into
libMesh:masterfrom
friedmud:nbx_push_vectors
Closed

Re-implement parallel sync using the NBX algorithm.#1826
friedmud wants to merge 3 commits into
libMesh:masterfrom
friedmud:nbx_push_vectors

Conversation

@friedmud
Copy link
Copy Markdown
Member

Also adds nonblocking_barrier() and possibly_receive()

Switch to a true sparse parallel send/receive using the NBX algorithm from: https://htor.inf.ethz.ch/publications/img/hoefler-dsde-protocols.pdf

That algorithm is similar to "my" algorithm I'm working on - but works better (in my benchmarking) for this case of one-sided, single hop, single data packet, point-to-point transfers.

@roystgnr
Copy link
Copy Markdown
Member

The timeout is scary, the NaNs above are scarier. I'll try to get a look at this shortly.

if (this->size() > 1)
{
LOG_SCOPE("nonblocking_barrier()", "Parallel");
libmesh_call_mpi(MPI_Ibarrier (this->get(), req.get()));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MPI_Ibarrier is an MPI3-only method, isn't it? We almost certainly want to keep MPI-2 capability as a fallback, even if builds with MPI-3 work faster. Didn't the code you showed me before use non-blocking reductions rather than non-blocking barriers?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It did - but this is more efficient. I do believe that it's MPI-3.

Why do we want to keep MPI-2? MPI-2 is CRAZY old at this point. Just like we moved on to C++11... we should move on.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MPI-3 is 2012 BTW... and it's WAY easy to upgrade MPI

@friedmud
Copy link
Copy Markdown
Member Author

I have no idea why this is failing BTW: I've been using it for days without any issues...

@friedmud
Copy link
Copy Markdown
Member Author

@roystgnr is it possible that some of the send() specializations aren't honoring the synchronous send flag? That would definitely lead to badness!

@roystgnr
Copy link
Copy Markdown
Member

Definitely thought that bug was possible, since IIRC we don't have any test coverage for send_mode(), but grepping through parallel_implementation.h doesn't find anything wrong. Every send implementation either tests send_mode or hands off to a different send implementation; nothing just blindly calls MPI_Send or MPI_Isend.

@friedmud
Copy link
Copy Markdown
Member Author

friedmud commented Aug 26, 2018 via email

@roystgnr roystgnr mentioned this pull request Oct 1, 2018
@moosebuild
Copy link
Copy Markdown

Job Test on c024b30 : invalidated by @friedmud

@friedmud friedmud force-pushed the nbx_push_vectors branch 4 times, most recently from e721c50 to 0234ad5 Compare November 21, 2018 07:01
Also adds nonblocking_barrier() and possibly_receive()
@friedmud friedmud force-pushed the nbx_push_vectors branch 3 times, most recently from cccbf84 to 6c9e18d Compare November 21, 2018 07:21
@friedmud
Copy link
Copy Markdown
Member Author

@roystgnr check this out. I think this is good to go now.

I didn't implement the vector<vector<T>> case yet. Do you see a way to unify those two at all - or will it just need to pretty much be a duplication? I tried to see if I could get a unified one to go... but I couldn't come up with the correct template foo to make it happen.

Note that I changed the way unique tag IDs are chosen now. The reason is fairly subtle... but basically you need two adjacent calls to push/pull_parallel_vector_data() to use definitely different tag IDs.

I went with @fdkong and looked at what PETSc does for this... and they do something close to what I've coded here (except that they count down and they don't bother to keep a list of currently used ones... mostly just hoping that if they wrap around there are no collisions).

In my testing yesterday this algorithm can make a BIG impact at 200M Dofs on 4k+ procs.

@roystgnr
Copy link
Copy Markdown
Member

Thanks!

@roystgnr check this out.

Will do now, literally. Multiple git checkouts, even. I'd like to run on a few different systems and make sure the "works on one system, breaks on another" sort of issues are really gone.

Assuming I can't shake loose any problems, we'll want to merge this ASAP. But at some point I may be tempted to add back "--sync_algorithm old" and "--sync_algorithm older" command line options for debugging and benchmarking purposes.

@roystgnr
Copy link
Copy Markdown
Member

I am seeing hangs. Going to see if I can get a sweep to trigger them via Civet too.

@roystgnr
Copy link
Copy Markdown
Member

Huh. At least my first hang seems like it's an unrelated bug that my tests just happened to trigger now. I'll put in a fix in a different PR and keep going.

pushed_keys_vals_to_me[pid] = data;
};

// Trade pushed dof constraint rows
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why'd this vanish?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well - I did have some back and forth on revisions that edited this part of code - just a messup on my part to leave this like this though. I'll fix it.

@roystgnr
Copy link
Copy Markdown
Member

I'm seeing one more failure, but I can't reproduce it everywhere, and I can get the same failure from a different branch. If you want this in master urgently I'm okay with merging now, but I'd prefer to have a little more time to look at it over the holiday.

@friedmud
Copy link
Copy Markdown
Member Author

Interesting - I'm not in any hurry... I'll be banging on this patch myself this weekend.

We were able to run some pretty big jobs with this patch today already... but it needs to be perfect before we put it in so let me know if you see something....

@friedmud
Copy link
Copy Markdown
Member Author

Also: thanks for looking at it!

// tag that is in use because we have
// ~2 billion of them
while(used_tag_values.count(new_tag))
new_tag++;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring tagvalue prevents use cases where the user wants to request tags in different orders on different processors; e.g. to use a task management system where ranks grab new bits of work as they complete prior tasks. We don't actually do that or intend to at any time in the near future, but it was the original reasoning.

So what's the reasoning behind removing it? We already had a "grab the next larger tag if the one we looked at first is already in use" mechanism. Why didn't it work for your case?

@friedmud
Copy link
Copy Markdown
Member Author

friedmud commented Dec 1, 2018

Fair - I hadn't thought about that use-case. Here's what doesn't work for my case: in completely non-blocking world... it can be the case that two adjacent calls of push or pull end up getting messages crossed. The reason why is super subtle... but let me see if I can explain:

The issue is that synchronous sends only guarantee that they return as "finished" when the receiver starts receiving the message... NOT when the message has been completely received.

So here's how it can play out with two processors (0 and 1) with two push()es in a row where proc0 needs to send to proc1 both times (but the data is different in each push())

(I was going to try to draw this - but it's actually even harder to draw than it is to just write out the list of events.

  1. proc0 calls push() grabs tag 1234
  2. proc1 calls push() grabs tag 1234
  3. proc1 hits the ibarrier because it's not sending anything
  4. proc0 initiates an issend to proc1
  5. proc1 sees that send using 1234 and starts an irecv
  6. proc0 is notified that proc1 is receiving - finishes the issend
  7. proc0 hits the ibarrier - and completes it
  8. proc1 is still working on the receive (meaning it's still looping, looking for additional messages and checking to see if the irecv is finished yet
  9. proc0 exits push()
  10. proc0 releases tag 1234 back into the pool because the function is over
  11. proc0 starts the next push()
  12. proc0 grabs 1234 again (because that's what's hardcoded for it to try to grab - and it's available again!)
  13. proc1 is still working on the receive...
  14. proc0 starts a new issend() using tag 1234 again
  15. proc1 (still in the first call to push()!) sees that send on a tag it's still looking for! It starts another irecv() to pull it in! Oh damn! Now proc1 has received two messages within the same call to push() when it was supposed to receive one message in two pushes()
  16. EVERYTHING DIES because proc1 is going to try to do something invalid with the second message...

Non-blocking land is a tricky beast! I've had my head in it now for nearly 3 years... and it' still tough to spot things like this!

So: this would all be fixed if the two push() calls in a row used different tags! If proc0 got 1235 for a tag in the second invocation of push()... then proc1 would totally ignore that message while it finished up receiving the first one... then it would exit the first push() and start the second one and do the right thing.

It should be noted again that the scheme that I've added here for managing tags is very closely related to what PETSc does... and I believe they probably do it for a very similar reason...

Also: there are other mitigation strategies (you could add true barrier() at the end of push() for instance)... but they are all slower. We want to allow proc0 to race ahead and start sending more messages... it might be sending to other processors who have already finished that first push() and they can start receiving... all while proc1 is still wrestling with that first message.

ALSO: note that this is not just an academic exercise. It's seriously easy to get this (I actually was!) if your messages are of a decent size and there are lot of senders...

@roystgnr
Copy link
Copy Markdown
Member

roystgnr commented Dec 3, 2018

Well, damn, that was a good answer.

And now I see the obvious - the old code was incrementing for ids already in use, then going back to the requested id as soon as it thought it could; the new code is incrementing for ids already in use but that's basically pointless because it stays incremented until wrapping around MAX_INT.

I still don't want to break anyone who decided to try a isend(tag=1); isend(tag=2); recv(tag=2); recv(tag=1) if we don't have to... but now I'm much less confident that we don't have to.

Could we make user tag requests an optional feature?

  • Make the get_unique_tag default value be MessageTag::invalid_tag
  • If we see invalid_tag we use _next_tag++ (or increment that further if that's already taken)
  • If we see a user-requested tag we use it (or increment it further if that's already taken)
  • Initialize _next_tag to 1B, so user requests are less likely to conflict with _next_tag
  • Make every libMesh internal get_unique_tag() call use the default, so even if we use those tags for asynchronous stuff we have fewer internal "user" requests likely to conflict with user requests.

int flag;
libmesh_call_mpi(MPI_Comm_get_attr(MPI_COMM_WORLD, MPI_TAG_UB, &maxTag, &flag));

_max_tag = *maxTag;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be incorrect. On MPICH 3.2 I get -4026531841 this way, which doesn't fit in int and isn't even positive. The MPI_Comm_get_attr man page seems to think we need to be passing in a void* and then casting the pointer to int* before dereferencing.

I'll do that; I've got a fork of this branch in which I'm trying out my "see if I can retain backwards tagvalue compatibility without breaking Derek's stuff" plan.

// libmesh_assert_equal_to (tagvalue, maxval);
// #endif
used_tag_values[new_tag] = 1;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dug into SVN history to see why this was commented out, and apparently Ben had a use case in which it was expected to fail, because tags were being grabbed on a subset of processors. So I feel a little better about going to the effort of respecting tagvalue where it's used while avoiding using it where possible.

Actually, that two-part system ought to make sanity checking easier. Where the user requests an automatic tag value, we can be sure they want to keep _next_tag in sync, and we can restore this check.

@roystgnr
Copy link
Copy Markdown
Member

This algorithm made it in via #1965

@roystgnr roystgnr closed this Nov 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants