Skip partitioning in copy_nodes_and_elems if we should by friedmud · Pull Request #1815 · libMesh/libmesh

friedmud · 2018-07-27T19:36:02Z

Huge optimization for stitching meshes...

MeshTools::correct_node_proc_ids() was taking a HUGE amount of time... and is completely unnecessary in this situation.

roystgnr

Great catch

pbauman · 2018-07-27T21:37:31Z

This seems to have affected the adjoint QoI error estimation, but only in parallel. It's the same, and only, failure on both recipes. And only in parallel, serial is fine.

==========================================================
Estimating error
==========================================================
The flux QoI 0 is: 3.3294760746586462e+01

Assertion `local_dof_bearing_nodes == final_local_dof_bearing_nodes' failed.
local_dof_bearing_nodes = 158
final_local_dof_bearing_nodes = 154


Assertion `local_dof_bearing_nodes == final_local_dof_bearing_nodes' failed.
local_dof_bearing_nodes = 184
final_local_dof_bearing_nodes = 188


[2] /femputer/pbauman/civet_build_testing_root/libMesh/src/error_estimation/adjoint_refinement_estimator.C, line 573, compiled Jul 27 2018 at 15:39:39
[3] /femputer/pbauman/civet_build_testing_root/libMesh/src/error_estimation/adjoint_refinement_estimator.C, line 573, compiled Jul 27 2018 at 15:39:39
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 60351 RUNNING AT femputer.eng.buffalo.edu
=   EXIT CODE: 1
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
FAIL regression/extra_quadrature_order_laplace_arefee_amr.sh (exit status: 1)

roystgnr · 2018-07-27T23:55:44Z

Damn, I'm glad that got caught, but I don't see how this could be triggering it. Basically that assertion says "we uniformly refined and then coarsened again and somehow that changed our DoF numbering", but:

That's the sort of error that would happen from accidentally turning on repartitioning, not turning it off.
Since when does that code use copy_nodes_and_elems?

Sorry I don't have time to debug this.

roystgnr · 2018-07-27T23:56:45Z

Well, wait, we're not turning off repartitioning, we're turning off correct_node_proc_ids, and that makes a difference. But still, we shouldn't even be coming near that code path in ARefEE.

friedmud · 2018-08-03T18:07:44Z

This is also causing problems with splitting meshes with certain partitioners... because without this fix there is no way to keep those partitioners from running during prepare_for_use (which will make them partition with the wrong number of partitions).

Can we get to the bottom of this so we can merge this? This PR really does seem like it should be the correct behavior... and it feels like whatever is breaking must be depending on a side-effect.

pbauman · 2018-08-03T18:23:48Z

Can we get to the bottom of this so we can merge this?

@roystgnr is out for awhile and I'm swamped with other deadlines. But I'll at least try and see if I can get a more in-depth stack trace. adjoints_ex6 didn't seem to trigger this failure so I can at least try and see if I can cook up a libMesh-only example that is triggered by this. But I can't do much more than this for at least the next several weeks.

This PR really does seem like it should be the correct behavior... and it feels like whatever is breaking must be depending on a side-effect.

I agree.

pbauman · 2018-08-03T18:28:39Z

Oh wait, it only breaks in parallel and I didn't run make check in parallel in femputer, so adjoints_ex6 might still be broken if we run it with more than 1 processor. Let's check that first.

pbauman · 2018-08-03T18:59:11Z

Dammit. I can't get adjoints_ex6 to break standalone. I need to take a closer look at the GRINS failing case.

pbauman · 2018-08-03T19:36:50Z

OK, I can get adjoints_ex6 to break!

Apply this patch:

[03:33:55][pbauman@fry:/fry1/data/users/pbauman/research/libmesh/pbauman/examples/adjoints/adjoints_ex6][skip_part_in_copy_nodes_elems] $ git diff
diff --git a/examples/adjoints/adjoints_ex6/adjoints_ex6.C b/examples/adjoints/adjoints_ex6/adjoints_ex6.C
index ed662aa..4824f50 100644
--- a/examples/adjoints/adjoints_ex6/adjoints_ex6.C
+++ b/examples/adjoints/adjoints_ex6/adjoints_ex6.C
@@ -196,7 +196,7 @@ std::unique_ptr build_adjoint_refinement_error_estim
   adjoint_refinement_estimator->qoi_set() = qois;
 
   // We enrich the FE space for the dual problem by doing 2 uniform h refinements
-  adjoint_refinement_estimator->number_h_refinements = 2;
+  adjoint_refinement_estimator->number_h_refinements = 1;
 
   return std::unique_ptr(adjoint_refinement_estimator);
 }
diff --git a/examples/adjoints/adjoints_ex6/general.in b/examples/adjoints/adjoints_ex6/general.in
index 034b3be..aee9709 100644
--- a/examples/adjoints/adjoints_ex6/general.in
+++ b/examples/adjoints/adjoints_ex6/general.in
@@ -45,7 +45,7 @@ nelem_target = 40000
 global_tolerance = 0
 
 # Are we doing uniform refinement steps
-refine_uniformly = true
+refine_uniformly = false 
 
 # Max number of refinements at each step
 refine_fraction = 0.05
@@ -54,10 +54,10 @@ refine_fraction = 0.05
 coarsen_fraction = 0.0
 
 # Coarsen threshold factor for refinement trading
-coarsen_threshold = 0
+coarsen_threshold = 0.1
 
 # The maximum number of adaptive steps per timestep
-max_adaptivesteps = 2
+max_adaptivesteps =  3 
 
 # Use what finite element space?
 fe_family = LAGRANGE
[03:33:56][pbauman@fry:/fry1/data/users/pbauman/research/libmesh/pbauman/examples/adjoints/adjoints_ex6][skip_part_in_copy_nodes_elems] $

Run with two processors and I get the following output:

[03:33:05][pbauman@fry:/fry1/data/users/pbauman/research/libmesh/install/skip_part_in_copy_nodes_elems/examples/adjoints/ex6]$ mpiexec -np 2 ./example-devel 
Started ./example-devel
Reading in and building the mesh
Building system
Initializing systems
 Mesh Information:
  elem_dimensions()={2}
  spatial_dimension()=2
  n_nodes()=81
    n_local_nodes()=45
  n_elem()=20
    n_local_elem()=10
    n_active_elem()=16
  n_subdomains()=1
  n_partitions()=2
  n_processors()=2
  n_threads()=1
  processor_id()=0

 EquationSystems
  n_systems()=1
   System #0, "PoissonSystem"
    Type "Implicit"
    Variables="T" 
    Finite Element Types="LAGRANGE", "JACOBI_20_00" 
    Infinite Element Mapping="CARTESIAN" 
    Approximation Orders="SECOND", "THIRD" 
    n_dofs()=81
    n_local_dofs()=45
    n_constrained_dofs()=32
    n_local_constrained_dofs()=17
    n_vectors()=2
    n_matrices()=1
    DofMap Sparsity
      Average  On-Processor Bandwidth <= 12.2222
      Average Off-Processor Bandwidth <= 1.62963
      Maximum  On-Processor Bandwidth <= 25
      Maximum Off-Processor Bandwidth <= 10
    DofMap Constraints
      Number of DoF Constraints = 32
      Average DoF Constraint Length= 0
      Number of Node Constraints = 0

  Nonlinear solver converged, step 0, residual reduction 9.99412e-11 < 1e-09
Adaptive step 0, we have 16 active elements and 49 active dofs.
Postprocessing: 
The computed QoI 0 is 36.230655305116564
The relative error in QoI 0 is 2.9365219718165605
Computing the error estimate using the Adjoint Refinement Error Estimator

The flux QoI 0 is: -36.230655307209972

The computed relative error in QoI 0 is 0.9275882192584336
The effectivity index for the computed error in QoI 0 is 0.31587988380846976
Refined mesh to 19 active elements and 57 active dofs.
  Nonlinear solver converged, step 0, residual reduction 4.2366907426593718e-11 < 1.0000000000000001e-09
Adaptive step 1, we have 19 active elements and 57 active dofs.
Postprocessing: 
The computed QoI 0 is 36.679825405894334
The relative error in QoI 0 is 3.3856920725943311
Computing the error estimate using the Adjoint Refinement Error Estimator

The flux QoI 0 is: -36.679825405637239

The computed relative error in QoI 0 is 0.86476474532195002
The effectivity index for the computed error in QoI 0 is 0.25541742331555645
Refined mesh to 22 active elements and 67 active dofs.
  Nonlinear solver converged, step 0, residual reduction 5.6987413243126504e-11 < 1.0000000000000001e-09
Adaptive step 2, we have 22 active elements and 67 active dofs.
Postprocessing: 
The computed QoI 0 is 37.000077436935747
The relative error in QoI 0 is 3.7059441036357441
Computing the error estimate using the Adjoint Refinement Error Estimator

The flux QoI 0 is: -37.008822648352393

The computed relative error in QoI 0 is 2.5286503300307319
The effectivity index for the computed error in QoI 0 is 0.68232284657234321
Refined mesh to 34 active elements and 107 active dofs.
  Nonlinear solver converged, step 0, residual reduction 3.3895969456915461e-11 < 1.0000000000000001e-09
Adaptive step 3, we have 34 active elements and 107 active dofs.
Postprocessing: 
The computed QoI 0 is 37.004887672501141
The relative error in QoI 0 is 3.710754339201138
Computing the error estimate using the Adjoint Refinement Error Estimator

The flux QoI 0 is: -36.986539654481803

Assertion `local_dof_bearing_nodes == final_local_dof_bearing_nodes' failed.
local_dof_bearing_nodes = 91
final_local_dof_bearing_nodes = 93


Assertion `local_dof_bearing_nodes == final_local_dof_bearing_nodes' failed.
local_dof_bearing_nodes = 76
final_local_dof_bearing_nodes = 74


[1] ../../pbauman/src/error_estimation/adjoint_refinement_estimator.C, line 573, compiled Aug  3 2018 at 14:33:53
[0] ../../pbauman/src/error_estimation/adjoint_refinement_estimator.C, line 573, compiled Aug  3 2018 at 14:33:53
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1

pbauman · 2018-08-03T19:41:17Z

Actually I don't even have to change adjoints_ex6.C. The following patch triggers failure on 2 processors:

[03:40:23][pbauman@fry:/fry1/data/users/pbauman/research/libmesh/pbauman/examples/adjoints/adjoints_ex6][skip_part_in_copy_nodes_elems] $ git diff
diff --git a/examples/adjoints/adjoints_ex6/general.in b/examples/adjoints/adjoints_ex6/general.in
index 034b3be..85852e8 100644
--- a/examples/adjoints/adjoints_ex6/general.in
+++ b/examples/adjoints/adjoints_ex6/general.in
@@ -45,7 +45,7 @@ nelem_target = 40000
 global_tolerance = 0
 
 # Are we doing uniform refinement steps
-refine_uniformly = true
+refine_uniformly = false 
 
 # Max number of refinements at each step
 refine_fraction = 0.05
@@ -54,10 +54,10 @@ refine_fraction = 0.05
 coarsen_fraction = 0.0
 
 # Coarsen threshold factor for refinement trading
-coarsen_threshold = 0
+coarsen_threshold = 0.1
 
 # The maximum number of adaptive steps per timestep
-max_adaptivesteps = 2
+max_adaptivesteps = 4 
 
 # Use what finite element space?
 fe_family = LAGRANGE

pbauman · 2018-08-03T19:42:39Z

It's similar on the GRINS side. It takes several adaptive steps. It can happen with 1 or 2 h_adaptive_step in the error estimator. In the adjoints_ex6 case, with 2 h-adaptive steps in the error estimator, it took one more adaptive step to trigger the assert.

pbauman · 2018-08-03T19:52:41Z

For posterity, here's the output with the last patch I posted.

[03:40:10][pbauman@fry:/fry1/data/users/pbauman/research/libmesh/install/skip_part_in_copy_nodes_elems/examples/adjoints/ex6]$ mpiexec -np 2 ./example-devel 
Started ./example-devel
Reading in and building the mesh
Building system
Initializing systems
 Mesh Information:
  elem_dimensions()={2}
  spatial_dimension()=2
  n_nodes()=81
    n_local_nodes()=45
  n_elem()=20
    n_local_elem()=10
    n_active_elem()=16
  n_subdomains()=1
  n_partitions()=2
  n_processors()=2
  n_threads()=1
  processor_id()=0

 EquationSystems
  n_systems()=1
   System #0, "PoissonSystem"
    Type "Implicit"
    Variables="T" 
    Finite Element Types="LAGRANGE", "JACOBI_20_00" 
    Infinite Element Mapping="CARTESIAN" 
    Approximation Orders="SECOND", "THIRD" 
    n_dofs()=81
    n_local_dofs()=45
    n_constrained_dofs()=32
    n_local_constrained_dofs()=17
    n_vectors()=2
    n_matrices()=1
    DofMap Sparsity
      Average  On-Processor Bandwidth <= 12.2222
      Average Off-Processor Bandwidth <= 1.62963
      Maximum  On-Processor Bandwidth <= 25
      Maximum Off-Processor Bandwidth <= 10
    DofMap Constraints
      Number of DoF Constraints = 32
      Average DoF Constraint Length= 0
      Number of Node Constraints = 0

  Nonlinear solver converged, step 0, residual reduction 9.99412e-11 < 1e-09
Adaptive step 0, we have 16 active elements and 49 active dofs.
Postprocessing: 
The computed QoI 0 is 36.230655305116564
The relative error in QoI 0 is 2.9365219718165605
Computing the error estimate using the Adjoint Refinement Error Estimator

The flux QoI 0 is: -36.230655307209972

The computed relative error in QoI 0 is 2.4147326937047575
The effectivity index for the computed error in QoI 0 is 0.8223104464670431
Refined mesh to 19 active elements and 57 active dofs.
  Nonlinear solver converged, step 0, residual reduction 4.2366907426593718e-11 < 1.0000000000000001e-09
Adaptive step 1, we have 19 active elements and 57 active dofs.
Postprocessing: 
The computed QoI 0 is 36.679825405894334
The relative error in QoI 0 is 3.3856920725943311
Computing the error estimate using the Adjoint Refinement Error Estimator

The flux QoI 0 is: -36.679825405637239

The computed relative error in QoI 0 is 3.0694618459445815
The effectivity index for the computed error in QoI 0 is 0.90659805443929986
Refined mesh to 22 active elements and 67 active dofs.
  Nonlinear solver converged, step 0, residual reduction 5.6987413243126504e-11 < 1.0000000000000001e-09
Adaptive step 2, we have 22 active elements and 67 active dofs.
Postprocessing: 
The computed QoI 0 is 37.000077436935747
The relative error in QoI 0 is 3.7059441036357441
Computing the error estimate using the Adjoint Refinement Error Estimator

The flux QoI 0 is: -37.008822648352393

The computed relative error in QoI 0 is 3.6688957314081927
The effectivity index for the computed error in QoI 0 is 0.99000298677165566
Refined mesh to 28 active elements and 85 active dofs.
  Nonlinear solver converged, step 0, residual reduction 9.2604373170744638e-11 < 1.0000000000000001e-09
Adaptive step 3, we have 28 active elements and 85 active dofs.
Postprocessing: 
The computed QoI 0 is 35.413841394123793
The relative error in QoI 0 is 2.1197080608237897
Computing the error estimate using the Adjoint Refinement Error Estimator

The flux QoI 0 is: -35.435965766674194

The computed relative error in QoI 0 is 2.0922047312006988
The effectivity index for the computed error in QoI 0 is 0.98702494455184453
Refined mesh to 34 active elements and 105 active dofs.
  Nonlinear solver converged, step 0, residual reduction 7.1958162095113093e-11 < 1.0000000000000001e-09
Adaptive step 4, we have 34 active elements and 105 active dofs.
Postprocessing: 
The computed QoI 0 is 34.979863391304818
The relative error in QoI 0 is 1.6857300580048147
Computing the error estimate using the Adjoint Refinement Error Estimator

The flux QoI 0 is: -34.975506469825348

Assertion `local_dof_bearing_nodes == final_local_dof_bearing_nodes' failed.
local_dof_bearing_nodes = 90
final_local_dof_bearing_nodes = 91


Assertion `local_dof_bearing_nodes == final_local_dof_bearing_nodes' failed.
local_dof_bearing_nodes = 79
final_local_dof_bearing_nodes = 78


[0] ../../pbauman/src/error_estimation/adjoint_refinement_estimator.C, line 573, compiled Aug  3 2018 at 14:33:53
[1] ../../pbauman/src/error_estimation/adjoint_refinement_estimator.C, line 573, compiled Aug  3 2018 at 14:33:53

friedmud · 2018-08-03T22:14:11Z

Thanks Paul - it's so weird... are those cases even explicitly turning off partitioning?

…

On Fri, Aug 3, 2018 at 1:52 PM Paul T. Bauman ***@***.***> wrote: For posterity, here's the output with the last patch I posted. ***@***.***:/fry1/data/users/pbauman/research/libmesh/install/skip_part_in_copy_nodes_elems/examples/adjoints/ex6]$ mpiexec -np 2 ./example-devel Started ./example-devel Reading in and building the mesh Building system Initializing systems Mesh Information: elem_dimensions()={2} spatial_dimension()=2 n_nodes()=81 n_local_nodes()=45 n_elem()=20 n_local_elem()=10 n_active_elem()=16 n_subdomains()=1 n_partitions()=2 n_processors()=2 n_threads()=1 processor_id()=0 EquationSystems n_systems()=1 System #0, "PoissonSystem" Type "Implicit" Variables="T" Finite Element Types="LAGRANGE", "JACOBI_20_00" Infinite Element Mapping="CARTESIAN" Approximation Orders="SECOND", "THIRD" n_dofs()=81 n_local_dofs()=45 n_constrained_dofs()=32 n_local_constrained_dofs()=17 n_vectors()=2 n_matrices()=1 DofMap Sparsity Average On-Processor Bandwidth <= 12.2222 Average Off-Processor Bandwidth <= 1.62963 Maximum On-Processor Bandwidth <= 25 Maximum Off-Processor Bandwidth <= 10 DofMap Constraints Number of DoF Constraints = 32 Average DoF Constraint Length= 0 Number of Node Constraints = 0 Nonlinear solver converged, step 0, residual reduction 9.99412e-11 < 1e-09 Adaptive step 0, we have 16 active elements and 49 active dofs. Postprocessing: The computed QoI 0 is 36.230655305116564 The relative error in QoI 0 is 2.9365219718165605 Computing the error estimate using the Adjoint Refinement Error Estimator The flux QoI 0 is: -36.230655307209972 The computed relative error in QoI 0 is 2.4147326937047575 The effectivity index for the computed error in QoI 0 is 0.8223104464670431 Refined mesh to 19 active elements and 57 active dofs. Nonlinear solver converged, step 0, residual reduction 4.2366907426593718e-11 < 1.0000000000000001e-09 Adaptive step 1, we have 19 active elements and 57 active dofs. Postprocessing: The computed QoI 0 is 36.679825405894334 The relative error in QoI 0 is 3.3856920725943311 Computing the error estimate using the Adjoint Refinement Error Estimator The flux QoI 0 is: -36.679825405637239 The computed relative error in QoI 0 is 3.0694618459445815 The effectivity index for the computed error in QoI 0 is 0.90659805443929986 Refined mesh to 22 active elements and 67 active dofs. Nonlinear solver converged, step 0, residual reduction 5.6987413243126504e-11 < 1.0000000000000001e-09 Adaptive step 2, we have 22 active elements and 67 active dofs. Postprocessing: The computed QoI 0 is 37.000077436935747 The relative error in QoI 0 is 3.7059441036357441 Computing the error estimate using the Adjoint Refinement Error Estimator The flux QoI 0 is: -37.008822648352393 The computed relative error in QoI 0 is 3.6688957314081927 The effectivity index for the computed error in QoI 0 is 0.99000298677165566 Refined mesh to 28 active elements and 85 active dofs. Nonlinear solver converged, step 0, residual reduction 9.2604373170744638e-11 < 1.0000000000000001e-09 Adaptive step 3, we have 28 active elements and 85 active dofs. Postprocessing: The computed QoI 0 is 35.413841394123793 The relative error in QoI 0 is 2.1197080608237897 Computing the error estimate using the Adjoint Refinement Error Estimator The flux QoI 0 is: -35.435965766674194 The computed relative error in QoI 0 is 2.0922047312006988 The effectivity index for the computed error in QoI 0 is 0.98702494455184453 Refined mesh to 34 active elements and 105 active dofs. Nonlinear solver converged, step 0, residual reduction 7.1958162095113093e-11 < 1.0000000000000001e-09 Adaptive step 4, we have 34 active elements and 105 active dofs. Postprocessing: The computed QoI 0 is 34.979863391304818 The relative error in QoI 0 is 1.6857300580048147 Computing the error estimate using the Adjoint Refinement Error Estimator The flux QoI 0 is: -34.975506469825348 Assertion `local_dof_bearing_nodes == final_local_dof_bearing_nodes' failed. local_dof_bearing_nodes = 90 final_local_dof_bearing_nodes = 91 Assertion `local_dof_bearing_nodes == final_local_dof_bearing_nodes' failed. local_dof_bearing_nodes = 79 final_local_dof_bearing_nodes = 78 [0] ../../pbauman/src/error_estimation/adjoint_refinement_estimator.C, line 573, compiled Aug 3 2018 at 14:33:53 [1] ../../pbauman/src/error_estimation/adjoint_refinement_estimator.C, line 573, compiled Aug 3 2018 at 14:33:53 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1815 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA1JMQALqif44_bAgOem4PWmIhWzrEAaks5uNKoLgaJpZM4VkkyY> .

pbauman · 2018-08-03T22:19:37Z

Thanks Paul - it's so weird...

Agreed, this is really weird.

are those cases even explicitly turning off partitioning?

Certainly not on the GRINS side; I would need to read adjoints_ex6 more carefully to confirm, but I doubt it's being set in there either. We need to carefully go through AdjointRefinementErrorEstimater::esimate_error(), that function does quite a bit (it's about 500 lines long...). It's gotta be in there that's triggering this code path.

pbauman · 2018-08-03T22:27:49Z

So it looks like partitioning is getting disabled because the partitioner gets detached during the refinement+compute error estimater + coarsen step and then restored at the end. But it looks like we don't want repartitioning in there.

pbauman · 2018-08-03T22:29:30Z

I should say I'm intimately familiar with this theory, but @roystgnr and @vikramvgarg are the ones that wrote this code, so they'll be more familiar with implementation here.

pbauman · 2018-08-03T22:31:23Z

Commit 9b5553d looks highly relevant. Perhaps that will trigger a memory with @roystgnr?

roystgnr · 2018-08-04T00:59:58Z

That commit 9b5553d is basically a complicated assertion - if you were to take it out then the ARefEE-using code would still break in every case, but the breakage would be slightly later and greatly more confusing.

We definitely don't want repartitioning during ARefEE (or URefEE), 1% for efficiency's sake and 99% because we're caching an arbitrary number of coarse vectors that we hope to just straight-copy back into place in a system that has been restored to have the exact same partitioning and DoF number ordering that it started with.

I've got like 10 minutes more before family-vacation-duties trump you guys, so I'm afraid I don't have time to look into this deeply.

Paul, if you're jiggering refinement settings around, is there any chance you can get the same failure to trigger without this PR? There's some complicated code here and it's not impossible that it's currently broken in master but only triggering now.

But I don't think that's the case. I think what's happening is that when doing coarsening of a second-order-element AMR mesh we may need to correct node proc ids, due to one of those crazy cases on hanging nodes. Picture a coarse QUAD9 on proc 0 next to 4 refined QUAD9s on proc 1. There are 5 nodes on the shared coarse edge, and under our default node partitioning three of them belong to proc 0 (because they're part of the coarse quad) and two (the midedge nodes on the refined quads) belong to proc 1.

Refine this, and now suddenly the two nodes from proc 1 canonically belong to proc 0, and we'll assign them accordingly.

Coarsen this again, and if you want the node partitioning to match the original node partitioning again (as we demand here) then you have to correct_node_proc_ids, even if we haven't repartitioned any elements!

I'm not sure what the best way is to make sure that correction happens in such cases but does not happen during mesh stitching.

roystgnr · 2018-08-04T01:13:36Z

Hmm... or perhaps the way to hit things is to go further, and make sure that those midpoint nodes are never reassigned to proc 0 in the first place? That's probably the best long-term fix even if it's not the easiest or quickest short-term fix.

Anyway, AFK for a while.

roystgnr · 2018-08-06T07:46:23Z

I don't think I'll find enough time to fix this soon but I can at least replicate it and I understand it a bit better.

It's not just a refinement-based-error-estimator problem; those are just the first code paths to catch an effect. In any mesh coarsening, there's the possibility that we may have a node whose processor id was previously consistent with the elements sharing it but now needs to change after the last element(s) sharing that id have vanished.

And from there the trouble snowballs: on a distributed mesh, when a node needs to be changed for this reason, there may be any number of other processors on which that node is ghosted but on which not enough of the elements sharing it have been ghosted for us to determine the problem.

The current "send queries for all ghost nodes" code might be greatly improved upon, perhaps? If we somehow did "send queries for all ghost nodes which might be candidates for a remote_elem-coarsening-forced pid change, and compute locally on any others" that would be a ton less communication. (in particular it would mean no communication at all on serialized meshes) The trouble is determining (in the distributed mesh case) which nodes are such candidates. Right now we even support meshes like domains that self-intersect at a single vertex with no neighbor links between the two sides sharing that vertex, but with no such links available the vertex owner doesn't necessarily even know that other processors have elements using it too! Even without crazy domains, supporting arbitrary ghosting patterns means that a node owner doesn't know what other processors have it ghosted.

As a first step, it would be possible to not send queries for any nodes that are definitely still consistent, when repartitioning is off. If I find any time to code I'll try that out.

friedmud · 2018-09-13T23:13:57Z

@roystgnr any movement on this? It's holding up idaholab/moose#11953

friedmud · 2018-09-13T23:14:09Z

Is there even something simple I can do for now?

friedmud · 2018-09-13T23:43:38Z

Maybe in the cases where you need this behavior we could just call MeshTools::correct_node_proc_ids() manually?

roystgnr · 2018-09-14T14:39:22Z

Wait, this is a correctness issue, not just a performance optimization? I'll try and give it another look today.

friedmud · 2018-09-16T16:47:45Z

@roystgnr yeah, it is a correctness problem.

The issue is that when doing n->m mesh splitting you don't want the mesh to be partitioned during reading/generation because it is going to be partitioned by the splitting process after that - and some partitioners only work on certain numbers of processors - like the GridPartitioner ( http://mooseframework.org/moose/source/partitioner/GridPartitioner.html ).

Without this fix it doesn't truly turn off partitioning completely which generates errors with partitioners like that.

roystgnr · 2018-12-10T14:50:19Z

Note to self so I don't waste time again later: none of the above breakage replicates with my default --enable-parmesh configuration; I have to be running with a default ReplicatedMesh to trigger it.

(That isn't because DistributedMesh doesn't fall victim the same problems, but if we want to trigger those problems with a different partitioning we're going to need different settings to get the exact same corner case to show up.)

roystgnr · 2018-12-20T15:32:58Z

So... the underlying problem here isn't node repartitioning, it's the utter failure of our communications algorithms (both old and new) to do the right thing in n->m situations, when my assumption that a DofObject with processor_id m corresponds to an actual processor m is completely wrong. We're actually already carefully avoiding unnecessary node repartitioning in the skip_repartitioning code path, and have been since b5a39e0 in #1659, which limits exceptions to only nodes whose prior assignment has been invalidated by adaptive coarsening. But the mere act of trying to communicate which nodes have been so orphaned triggers the communication bugs, just as in #1950.

Let me see if I can set up a test replicating the problem case(s) first, then I'll try to base a fix on top of #1965.

friedmud · 2019-03-06T04:13:54Z

This is still important - regardless of communication algorithm and n->m stuff.

If we want to skip partitioning - then... it should be skipped!

I'm back in here commenting about this because I noticed, once again, that copy_nodes_and_elements() is incredibly slow. This time because of mesh stitching (look for another PR soon to try to speed up mesh stitching by doing less prepare_for_use() during stitching... but this PR is also important).

In my opinion - any code that relies on partitioning happening even though we've explicitly told libMesh not to partition... is broken code that should be removed until it can be refactored (or another workaround is found).

When building large meshes (100M+ elements) with a lot of stitching this one thing can cost hours.

roystgnr · 2019-03-06T14:36:59Z

In my opinion - any code that relies on partitioning happening even though we've explicitly told libMesh not to partition... is broken code that should be removed until it can be refactored (or another workaround is found).

You want libMesh to support nodes that aren't connected to any element owned by their owner? If we're going to refactor half the library then I'd rather get a benefit like "anisotropic refinement", not a benefit like "we no longer have to reassign inconsistently partitioned nodes when coarsening".

dschwen · 2019-03-06T15:11:48Z

+1 on "anisotropic refinement" ;-)

…

On Wed, Mar 6, 2019, 7:50 AM roystgnr ***@***.***> wrote: In my opinion - any code that relies on partitioning happening even though we've explicitly told libMesh not to partition... is broken code that should be removed until it can be refactored (or another workaround is found). You want libMesh to support nodes that aren't connected to any element owned by their owner? If we're going to refactor half the library then I'd rather get a benefit like "anisotropic refinement", not a benefit like "we no longer have to reassign inconsistently partitioned nodes when coarsening". — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1815 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAMWPiOWJ5Q7RO7zFPVnr5xsPVAoZ8ruks5vT9WogaJpZM4VkkyY> .

pbauman · 2019-03-06T15:21:56Z

+1 on "anisotropic refinement" ;-)

I see my comments at the SIAM CSE tutorial along the lines of that "being a huge amount of work" hasn't dissuaded @dschwen.

friedmud · 2019-03-06T15:42:36Z

@roystgnr I don't understand your statement - what does that have to do with turning off partitioning?

If someone has turned off partitioning - then it should be off. Just because there is a workflow where if partitioning is turned off then bad stuff happens... that doesn't mean I shouldn't be able to turn it off and have that actually mean that it's off.

I feel like we're increasingly having these types of discussions. What is the role of libMesh? Is it to hold your hand and make sure nothing bad can ever happen... or is it to be flexible and do what it has been prescribed to do. To me: if I tell a piece of software that I want to do something (like NOT partition)... and it decides to do it anyway... that's a BUG.

What do we have skip_partitioning() for if it's going to be ignored?!

I understand that this breaks one small use-case. But that small use-case should have never been dependent on this anyway... and a workaround should be made in that use-case. Not keep something fundamental (turning off partitioning) broken for everyone else.

roystgnr · 2019-03-06T16:03:40Z

What do we have skip_partitioning() for if it's going to be ignored?!

To skip partitioning of elements and validly partitioned nodes.

I understand that this breaks one small use-case.

Calling parallel coarsening a "small use-case" is ridiculous.

But that small use-case should have never been dependent on this anyway...

I agree, but that's where we were. The alternatives then were "add a skip_partitioning() option that has a workaround" or "don't add a skip_partitioning() option".

and a workaround should be made in that use-case.

So, something like, "We're actually already carefully avoiding unnecessary node repartitioning in the skip_repartitioning code path, and have been since b5a39e0 in #1659, which limits exceptions to only nodes whose prior assignment has been invalidated by adaptive coarsening." perhaps? The rest of that comment would probably be helpful for you too.

friedmud · 2019-03-06T16:34:54Z

Calling parallel coarsening a "small use-case" is ridiculous.

The "small use-case" is: calling parallel coarsening with skip_partitioning = true.

THAT's a small use-case... and is obviously just an invalid thing to do. But instead of making skip_partitioning just not do what it's supposed to do... you just shouldn't skip_partitioning if you're doing refinement / coarsening.

You're still dodging the main question: if I set skip_partitioning(true) do you think it's ok for the software to still do partitioning?

That's the only question here.

friedmud · 2019-03-06T16:38:33Z

So: my proposal is this:

Disable the GRINS tests that are relying on the incorrect behavior of skip_partitioning().
Merge this
Work to call partition() directly within the GRINS code where it needs to be called (or tell it not to skip partitioning at the right time) and re-enable the tests.

friedmud · 2019-03-06T16:54:18Z

This is a one-line PR that has been here since July. This is important to us (I lost another day yesterday to it). There needs to be some resolution here.

Should we add actually_skip_partitioning() or really_completely_skip_partitioning()?

friedmud · 2019-03-06T16:56:05Z

Similarly we should deprecate skip_partitioning() in favor of possibly_skip_partitioning() so that people aren't confused....

vikramvgarg · 2019-03-06T17:16:07Z

Hi Derek, We are using Adjoint Refinement Error Estimator for a lot of our work and have referenced it as a capability of libMesh in publications. There has to be a better way to sort this out rather than effectively undoing a substantial amount of work which ties in very strongly with present and future work for both Roy, Paul and I at the very least. Thanks. Vikram Garg vikramvgarg.github.io/

…

On Wed, Mar 6, 2019 at 10:56 AM Derek Gaston ***@***.***> wrote: Similarly we should deprecate skip_partitioning() in favor of possibly_skip_partitioning() so that people aren't confused.... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

roystgnr · 2019-03-06T17:16:36Z

I agree this should be resolved; breaking others' code just isn't the way to do it. Neither is indefinitely disabling a major libMesh feature. So, counterproposal:

1: Send me something that replicates the problems. (something(s), assuming the GridPartitioner incompatibility and the mesh stitching inefficiency can't be

Extra credit: Let me know if #1965 has any effect on either for you. I'll check myself if you don't have time.

If the answer to 2 is "no" or "only on one" then I've got an idea for a somewhat hackish temporary fix, and if that works then we can get it in shortly while I figure out what the underlying problem is.

roystgnr · 2019-03-06T18:14:30Z

Ok, to recap that phone call:

Possible quick partial fix: limiting the PR here to just the serial case (which shouldn't break Vikram's or anybody's code but which should fix Derek's current use case). Dereks' going to put a commit in here to try that.
There's a GridPartitioner example deck in the MOOSE test suite I can use to trigger the correctness bug (is it tests/partitioners/grid_partitioner/grid_partitioner.i? How do I get the bug to trigger?)
@permcody is setting up an iterative-mesh-stitching case I can use to trigger the performance case. If the serial-only quick fix isn't an adequate temporary workaround for that too then I can try my own hackish fix idea; otherwise the long-term fix is probably to move the call to the orphaned node hunting code out of partition() so we can more selectively limit when it gets invoked.

And damn it. @friedmud just buzzed me to point out that the serial check isn't going to work for everyone. Let me see what I can hack together instead.

roystgnr · 2019-03-07T17:17:39Z

This got included into #2061

Skip partitioning in copy_nodes_and_elems if we should

2325c2e

roystgnr approved these changes Jul 27, 2018

View reviewed changes

friedmud mentioned this pull request Aug 6, 2018

Enforce that splitting meshes works with custom partitioners idaholab/moose#11953

Closed

roystgnr mentioned this pull request Feb 1, 2019

Re-implement parallel sync using the NBX algorithm #1965

Merged

friedmud mentioned this pull request Mar 6, 2019

Fixes for greatly speeding up mesh stitching #2059

Closed

roystgnr mentioned this pull request Mar 6, 2019

Skip partitioning in copy_nodes_and_elems if we should #2061

Merged

roystgnr closed this Mar 7, 2019

Conversation

friedmud commented Jul 27, 2018

Uh oh!

roystgnr left a comment

Choose a reason for hiding this comment

Uh oh!

pbauman commented Jul 27, 2018

Uh oh!

roystgnr commented Jul 27, 2018

Uh oh!

roystgnr commented Jul 27, 2018

Uh oh!

friedmud commented Aug 3, 2018

Uh oh!

pbauman commented Aug 3, 2018

Uh oh!

pbauman commented Aug 3, 2018

Uh oh!

pbauman commented Aug 3, 2018

Uh oh!

pbauman commented Aug 3, 2018

Uh oh!

pbauman commented Aug 3, 2018

Uh oh!

pbauman commented Aug 3, 2018

Uh oh!

pbauman commented Aug 3, 2018

Uh oh!

friedmud commented Aug 3, 2018 via email

Uh oh!

pbauman commented Aug 3, 2018

Uh oh!

pbauman commented Aug 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pbauman commented Aug 3, 2018

Uh oh!

pbauman commented Aug 3, 2018

Uh oh!

roystgnr commented Aug 4, 2018

Uh oh!

roystgnr commented Aug 4, 2018

Uh oh!

roystgnr commented Aug 6, 2018

Uh oh!

friedmud commented Sep 13, 2018

Uh oh!

friedmud commented Sep 13, 2018

Uh oh!

friedmud commented Sep 13, 2018

Uh oh!

roystgnr commented Sep 14, 2018

Uh oh!

friedmud commented Sep 16, 2018

Uh oh!

roystgnr commented Dec 10, 2018

Uh oh!

roystgnr commented Dec 20, 2018

Uh oh!

friedmud commented Mar 6, 2019

Uh oh!

roystgnr commented Mar 6, 2019

Uh oh!

dschwen commented Mar 6, 2019 via email

Uh oh!

pbauman commented Mar 6, 2019

Uh oh!

friedmud commented Mar 6, 2019

Uh oh!

roystgnr commented Mar 6, 2019

Uh oh!

friedmud commented Mar 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

friedmud commented Mar 6, 2019

Uh oh!

friedmud commented Mar 6, 2019

Uh oh!

friedmud commented Mar 6, 2019

Uh oh!

pbauman commented Aug 3, 2018 •

edited

Loading

friedmud commented Mar 6, 2019 •

edited

Loading