Skip to content

Re-implement parallel sync using the NBX algorithm#1965

Merged
roystgnr merged 25 commits into
libMesh:masterfrom
roystgnr:nbx_push_vectors
Mar 26, 2019
Merged

Re-implement parallel sync using the NBX algorithm#1965
roystgnr merged 25 commits into
libMesh:masterfrom
roystgnr:nbx_push_vectors

Conversation

@roystgnr
Copy link
Copy Markdown
Member

@roystgnr roystgnr commented Dec 4, 2018

This takes the algorithm from @friedmud in #1826, but fixes the MPI_TAG_UB bug, retains the ability to generate a MessageTag with a specified value, and adds a dbg/devel mode verification of automatic tag values when those are included.

This still needs a couple unit tests and I'm just getting started with my own private testing but I think it's in shape to throw at Civet and I'm done for today.

Also, we need to look at the tag choices in System::write_serialized_blocked_dof_objects - is it just me, or is that going to risk failure once num_blks exceeds 100, with only a few hundred million elements?

@roystgnr
Copy link
Copy Markdown
Member Author

roystgnr commented Dec 6, 2018

Spraying this thing with tests, but at this point if Civet decides it's happy then so am I. Not going to merge until @friedmud agrees too, of course.

@friedmud
Copy link
Copy Markdown
Member

@roystgnr sorry I dropped out on this - I've had my head buried in my dissertation for the last couple of weeks. Let me review this for a bit and get back to you...

friedmud and others added 11 commits January 24, 2019 16:51
Also adds nonblocking_barrier() and possibly_receive()
This is based on Derek's algorithm, but retains backwards
compatibility with manually selected tags.
The libmesh_call_mpi wrapper() should hide the actual MPI call,
and then initializing the int to 0 should make the --disable-mpi
behaviour "possibly_receive always returns false", which sounds
accurate enough.

Now I just need to test to see if this breaks horribly in the context
of the NBX parallel_sync.h when MPI is disabled.
Otherwise the linker gets confused when it can't find an
instantiation.
This first test just covers M->M push syncs.
This sort of error is a bit easier to debug with libMesh testing than
with my MPI stack's.
And refactor out a bit of repetitiveness.
And refactor a bit more to reduce the size of this rapidly growing
test collection
We assume the same round-robin rule here that we do with CheckpointIO,
so hopefully we can standardize the way we deal with N->M mesh
splitting
This looks like it'll be the best way to support N->M pull_parallel,
but it turns out to be easy and cheap and might be useful in its own
right.
@roystgnr
Copy link
Copy Markdown
Member Author

I'm guessing that Bison Node #427 specified in 'top_disp_r_fuel' not found in the mesh! in serial wasn't likely to be due to a change in parallel codes that pass all the other tests. Hopefully it'll have been fixed by this next push.

@roystgnr
Copy link
Copy Markdown
Member Author

roystgnr commented Feb 1, 2019

So, N->M pushes and pulls are now working properly in unit tests, which means there's a chance that this PR also fixes the underlying bug(s) behind #1950 and partially obviates #1815... but now that I'm moving on to try something more complicated than a unit test, I find I can't replicate the problem in #1950. At least not on less than a hundred thousand elements; I'm trying out half a million next.

In hindsight this shouldn't have surprised me, since the trigger for the bug must have been a hell of a corner case for us to not hit it sooner. @friedmud, can you send me a test case (MOOSE input file + mesh file(s)?) that goes nuts in either fashion? Something I can fit on 24 cores and 64GB would be ideal; preferably still something that triggers within 640 cores and 1280GB otherwise.

Even if we don't get that testing done right away or this turns out to be an insufficient fix, I'm increasingly confident that it's a solid improvement, but since it's also a major change IMHO we should still wait until after 1.4.0 branches before merging so our git users can shake it down before our release users get it.

@roystgnr
Copy link
Copy Markdown
Member Author

@jwpeterson recently got a new release out with the older more tested code, and @bboutkov recently reported some solid performance improvement results for this on libmesh-devel, and I recently wasted some time after accidentally triggering a bug that this branch fixes, so even if this branch isn't a fix-all I do think it's enough of an improvement to be merged and it's the ideal time to merge it. I'll do so shortly unless anyone screams.

@roystgnr roystgnr merged commit a43f79a into libMesh:master Mar 26, 2019
@roystgnr roystgnr deleted the nbx_push_vectors branch November 12, 2019 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants