Skip to content

GenericProjector rewrite#1938

Merged
roystgnr merged 20 commits into
libMesh:masterfrom
roystgnr:projection_refactor
Apr 12, 2019
Merged

GenericProjector rewrite#1938
roystgnr merged 20 commits into
libMesh:masterfrom
roystgnr:projection_refactor

Conversation

@roystgnr
Copy link
Copy Markdown
Member

I'm posting this at an early stage, both so it can get into CI (it's passing my local tests now but this is a huge change that could have all kinds of corner cases) and so @ajamar1 can try it out (it passes the test case in #1919 for me, at least), but anybody who isn't running the impacted transient-repartitioning-of-subdomain-restricted-variables case isn't going to want to use the current unbenchmarked bag of optimizations mixed with pessimizations in the first commits here. This is going to need more optimization work before it's ready to merge.

The goals:

Make projections faster by removing redundant calculations

Make projections correct even in the corner cases where subdomain-restricted
variables are on DoFs owned by processors which don't own that
subdomain.

The means:

Sort projection operations by vertices, edges, faces, interiors,
avoiding duplication.

Test ghost nodes' patches to see what their owners can compute, and
push our computed data to the owners if necessary, in between the
steps where it's computed and the steps where it's used.

The current problems:

Looping over vertices/edges/nodes rather than over elements is forcing
more context data initialization.

This initial commit is missing some of the common case optimizations that
were in the previous version; those need to be re-added.

@roystgnr
Copy link
Copy Markdown
Member Author

Those test failures might just be because one of the optimizations (interpolating rather than projecting non-vertex LAGRANGE DoFs) can also give slightly different results. I guess that's the optimization I'll add back in first.

@roystgnr
Copy link
Copy Markdown
Member Author

Now that I've started more extensive testing:

The good news: at least with complicated FE types this already seems to be ~25% faster than the previous code, so I'm now much more confident that even the simple cases will be faster after I restore those optimizations.

The bad news: this may pass most tests here, but there's still bugs. I haven't found any incorrect results yet, but at least one or two tests I've run hang.

@ajamar1
Copy link
Copy Markdown

ajamar1 commented Nov 20, 2018 via email

@roystgnr roystgnr force-pushed the projection_refactor branch from 9903231 to 57eddfb Compare December 3, 2018 19:17
@roystgnr roystgnr mentioned this pull request Jan 22, 2019
@roystgnr roystgnr force-pushed the projection_refactor branch 2 times, most recently from 436d315 to dd51747 Compare February 1, 2019 22:53
@moosebuild
Copy link
Copy Markdown

Job Test on dd51747 : invalidated by @roystgnr

Kicking; hoping that geomsearch/patch_update_strategy.never_warning timeout was a fluke

@roystgnr roystgnr force-pushed the projection_refactor branch from 1c39af5 to 21a66cb Compare March 26, 2019 21:07
@milljm
Copy link
Copy Markdown

milljm commented Mar 27, 2019

unrelated to this PR, @roystgnr would it help if I modified the civet recipe to not stop at the first failed target?

https://civet.inl.gov/job/311338/

@roystgnr
Copy link
Copy Markdown
Member Author

Nah, in general an early exit is ideal. Sometimes I find that I can't replicate a CI failure and then having subsequent failed cases to try is helpful, but that's very rare.

@roystgnr
Copy link
Copy Markdown
Member Author

Well that's interesting. In that initial_adapt test it's not the solution variables that look weird, it's the "box" variable that got set by the BoxMarker. In the gold standard it's 2 in the box and 1 everywhere else; in the broken output it's basically equal to elem->level() (2 in the box, 1 in the transitional layer, 0 everywhere else).

@roystgnr
Copy link
Copy Markdown
Member Author

fatal error: petsc.h: No such file or directory

Okay, I'm actually pretty sure this one wasn't my fault...

@moosebuild
Copy link
Copy Markdown

Job Test on 65f0e17 : invalidated by @roystgnr

Trying again, hopefully with PETSc back

@roystgnr
Copy link
Copy Markdown
Member Author

No, the 64bit and debug tests still can't find petsc.h. Anyone have any ideas?

Oh, and no GRINS tests yet, but that's just the usual "femputer is down again". @pbauman?

@roystgnr
Copy link
Copy Markdown
Member Author

I may have finally exhausted the bugs here, so if the plain Test CI passes then I'll do one last pass at interactive rebase cleanup and start throwing parallel sweeps at this thing.

@roystgnr
Copy link
Copy Markdown
Member Author

Nope, plain Test CI has the same petsc.h problem

@permcody
Copy link
Copy Markdown
Member

We are actively working with CIVET. We didn't just invalidate because we are trying to understand what was going on. Summary: We've recently added a "clean machine" target to every single event so that the build machines will ensure they have enough space before they run a job. Some of our targets require upwards of 40GBs of space starting from a fresh disk! This will give us a more reliable testing infrastructure moving forward. What this has exposed is obscure problems in our cloning/resetting logic. With nested submodules, it's actually pretty easy to confuse git when you "force delete" a parent submodule, which can cause bad things to happen downstream.

@permcody
Copy link
Copy Markdown
Member

Actually hang on... I see your failure. This has nothing to do with nested submodules... It looks like configure properly detected and configured for PETSc. Oh goodie, another problem to look into.

@milljm
Copy link
Copy Markdown

milljm commented Mar 28, 2019

I am not exactly sure how to fix this. Here is what is happening.

As civet launches, it reads a file which instructs it on which 'moose-environment' package to load. The file in question is called 'package_version'. Civet searches for this file in the following order:

if 
  $MOOSE_DIR/package_version
elif
  $REPO_DIR/moose/package_version
elif
  $REPO_DIR/package_version
else:
  use latest available version
fi

While libMesh does not have a package_version file sitting in the root of its repo, civet will use the latest available version. But later on in the civet recipe, we are doing something else. MOOSE is instructing civet to use a different set of compilers.

I need to devise a way, that once a recipe begins, it needs to 'stick' with that set of compilers, regardless what other repos might instruct civet to do.

@roystgnr
Copy link
Copy Markdown
Member Author

I am not exactly sure how to fix this.

Ping me when you come up with a fix or a workaround? I've got no shortage of other stuff I can be working on, but half of it will also hit the "needs MOOSE CI tests to make sure Roy's not being an idiot" stage sooner or later.

@milljm
Copy link
Copy Markdown

milljm commented Mar 28, 2019

A quick fix, is for you to add a file to the root of your repo, instructing civet to use the following package version:

echo "repo-hash:56178942a6f4cf47dfc6bb8b678d26bb6c5a32ed" > package_version

This will allow you to work on this PR today... While we try and solve this for tomorrow.

Note: I have a PR into moose which will update it's package_version file... So... unfortunately, once that goes in later tonight, this PR will start failing again... requiring you to update that package_version file to match what moose/package_version contains.

Referencing: idaholab/moose#13147

@milljm
Copy link
Copy Markdown

milljm commented Mar 28, 2019

@roystgnr hold off on adding package_version to your repo (if you were heading that direction)... we are close to figuring out a way to do this without a change on your end.

@roystgnr
Copy link
Copy Markdown
Member Author

This was passing the GRINS tests even before those last bugfixes, so I'll take that as gospel even if femputer isn't back up soon.

I'm working on unit testing that will improve the libMesh test coverage (I added special code paths to optimize some common MOOSE use cases but then half those weren't properly covered by libMesh test or example cases) and that will in particular cover the original bug report that got me into this mess, and I'll run some sweeps in CI here while I do so, but once that's all done this ought to be finally ready to merge.

@pbauman
Copy link
Copy Markdown
Member

pbauman commented Mar 28, 2019

Sorry, rebooted femputer, it's running now.

@roystgnr
Copy link
Copy Markdown
Member Author

roystgnr commented Apr 2, 2019

Gah. Both those segfaults are in the --enable-parmesh unit tests, but on my own machines I can sweep those from 1 proc through 24 procs, in debug mode even, with no problems.

roystgnr added 15 commits April 10, 2019 21:14
The goals:

Make this faster by removing redundant calculations

Make this correct even in the corner cases where subdomain-restricted
variables are on DoFs owned by processors which don't own that
subdomain.

The means:

Sort projection operations by vertices, edges, faces, interiors,
avoiding duplication.

Test ghost nodes' patches to see what their owners can compute, and
push our computed data to the owners if necessary, in between the
steps where it's computed and the steps where it's used.

The problems:

Looping over vertices/edges/nodes rather than over elements is forcing
more context data initialization.

This first step is missing some of the common case optimizations that
were in the previous version; those need to be re-added.
This fixes a unit test failure for me, hopefully fixes any code
further downstream that was relying on our previous
interpolation-vs-projection behavior, and ought to avoid us
pessimizing those cases.
We can't try to do mixed up MPI communication inside threads the way I
was before.
This is useful for getting indices selectively for copying during
projections.
This prevents us from having to do a sort or work with an unsorted
vector in the new GenericProjector code.
The optimization had been *half* added back, which left bugs as well
as pessimizations.

This code needs cleanup but it's hopefully correct now.
Counter-intuitively the extra "if" requirement *saves* a branch
instruction in some cases and doesn't add one in any case.
This should fail on some partitionings in a similar way to libMesh#1919 did,
and so should protect us a little from regressions in the
subdomain-restricted-variable support in GenericProjector
This saves a few lines of code and makes it a bit easier to understand
This test breaks (with logic errors) before the GenericProjector
refactor
This makes it easier to play with the --re option when debugging
tests, and any options that work well for our example codes are surely
going to work well enough for our unit tests too.
@roystgnr roystgnr force-pushed the projection_refactor branch from b3955ed to fab2cf0 Compare April 11, 2019 02:17
@roystgnr
Copy link
Copy Markdown
Member Author

I'd love to get more eyes on this first (@jwpeterson, @bboutkov, @ajamar1 ?), but at this point I think it's ready to merge assuming CI is still happy with the last swath of new cppunit assertions. Most of the recent failures in the new test coverage turned out to be bugs in master too, and at least in my own runs this PR is now passing all the new testing with flying colors (which is more than I can say for master, even after the first rounds of fixes, because the third fix would be harder to backport).

This makes testing easier and might help other use cases too.
If we have subdomain-restricted variables, and we ask to evaluate one
on a boundary between elements that support the variable and elements
that do not, obviously we never want to try evaluating on one of the
latter.
@roystgnr
Copy link
Copy Markdown
Member Author

Ok, now I've got test coverage that seems to fail reliably on master in the same way as #1919 did and that works reliably on this PR for me, so hopefully we can avoid any future regressions in the bug that started this whole mess! If CI is still happy too then I'll merge.

@bboutkov
Copy link
Copy Markdown
Member

So I gave these changes (without the last 5 commits) a whirl yesterday, and the overall results are looking good. First of all, in terms of correctness, I was not able to produce any failures after testing various configurations targeting GMG + extensive use of projection_matrix(), so looks good to me on that end. Second, I really like the the logging improvements for projection related calls, it looks so much cleaner now, so thanks for those changes!

I also took a look at overall weak scaling timings for GMG on Poisson and how I was affected by this PR. For overall timing, these changes shave off ~0.5s off each run up to np256, so that seems to be in line with the cleanup. Since I didn't see too much overall change I also looked at the timing of the GenericProjector itself and noticed a bit of a curiosity (see plot below). While the speedup was fairly consistent up to np64, afterwards it wasn't as noticeable. I wasn't quite able to track down any particular reason for this, partly because these timings are subject to about as much variability from run to run as the difference between the pre vs post changes, but after a few tries I was able to consistently replicate this difference at np256. Perhaps its OK, since after all these timings are less than a couple of seconds for 55k local elem for the big run, but I figured I would make the note since I wasn't able to get any higher proc count runs in time to ensure this issue doesn't get even worse at higher proc counts.

gen_proj_proj

@roystgnr
Copy link
Copy Markdown
Member Author

Thanks for that!

The last five commits are all new tests, new features for use in new tests, or fixes for new tests, so shouldn't affect any of your results.

What's nasty here is that I'm afraid we expect some loss of scalability, because the fix for #1919 necessarily entailed communication steps that didn't exist in the original algorithm; better to get the right answer slowly than to get the wrong answer quickly. I was hoping for the crossover point to be in the tens of thousands of processors, though, not in the hundreds.

We might be able to mitigate the problem slightly (and improve some other areas) by enabling more overlap of communication and calculation. #1965 was actually a tiny regression in that respect (though utterly worth it for the asymptotic speedup in the common case), so we should have some room to improve, and I've been thinking about how.

Oh, and wait a minute - those timings of yours are for the GMG projection matrix constructions, right? I'm afraid one of the sends for that case doesn't use the new algorithm from #1965, because that currently is only available when sending vectors of fixed-length types, whereas the projection matrix rows have to get sent as vectors-of-vectors, which falls back on my intermediate alltoall-limited algorithm, which means we expect a serious loss of scalability. So the good news from most users' perspective is that you're kind of the worst-case scenario here, but the bad news from your perspective is you're kind of the worst-case scenario here! I apologize! If you have any plans to do GMG on thousands of cores in the near future, let me know, and I'll bump that optimization up higher on my todo list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants