Skip partitioning in copy_nodes_and_elems if we should#1815
Conversation
|
This seems to have affected the adjoint QoI error estimation, but only in parallel. It's the same, and only, failure on both recipes. And only in parallel, serial is fine. ========================================================== Estimating error ========================================================== The flux QoI 0 is: 3.3294760746586462e+01 Assertion `local_dof_bearing_nodes == final_local_dof_bearing_nodes' failed. local_dof_bearing_nodes = 158 final_local_dof_bearing_nodes = 154 Assertion `local_dof_bearing_nodes == final_local_dof_bearing_nodes' failed. local_dof_bearing_nodes = 184 final_local_dof_bearing_nodes = 188 [2] /femputer/pbauman/civet_build_testing_root/libMesh/src/error_estimation/adjoint_refinement_estimator.C, line 573, compiled Jul 27 2018 at 15:39:39 [3] /femputer/pbauman/civet_build_testing_root/libMesh/src/error_estimation/adjoint_refinement_estimator.C, line 573, compiled Jul 27 2018 at 15:39:39 application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2 =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 60351 RUNNING AT femputer.eng.buffalo.edu = EXIT CODE: 1 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== FAIL regression/extra_quadrature_order_laplace_arefee_amr.sh (exit status: 1) |
|
Damn, I'm glad that got caught, but I don't see how this could be triggering it. Basically that assertion says "we uniformly refined and then coarsened again and somehow that changed our DoF numbering", but:
Sorry I don't have time to debug this. |
|
Well, wait, we're not turning off repartitioning, we're turning off correct_node_proc_ids, and that makes a difference. But still, we shouldn't even be coming near that code path in ARefEE. |
|
This is also causing problems with splitting meshes with certain partitioners... because without this fix there is no way to keep those partitioners from running during prepare_for_use (which will make them partition with the wrong number of partitions). Can we get to the bottom of this so we can merge this? This PR really does seem like it should be the correct behavior... and it feels like whatever is breaking must be depending on a side-effect. |
@roystgnr is out for awhile and I'm swamped with other deadlines. But I'll at least try and see if I can get a more in-depth stack trace. adjoints_ex6 didn't seem to trigger this failure so I can at least try and see if I can cook up a libMesh-only example that is triggered by this. But I can't do much more than this for at least the next several weeks.
I agree. |
|
Oh wait, it only breaks in parallel and I didn't run make check in parallel in femputer, so adjoints_ex6 might still be broken if we run it with more than 1 processor. Let's check that first. |
|
Dammit. I can't get adjoints_ex6 to break standalone. I need to take a closer look at the GRINS failing case. |
|
OK, I can get adjoints_ex6 to break! Apply this patch: [03:33:55][pbauman@fry:/fry1/data/users/pbauman/research/libmesh/pbauman/examples/adjoints/adjoints_ex6][skip_part_in_copy_nodes_elems] $ git diff diff --git a/examples/adjoints/adjoints_ex6/adjoints_ex6.C b/examples/adjoints/adjoints_ex6/adjoints_ex6.C index ed662aa..4824f50 100644 --- a/examples/adjoints/adjoints_ex6/adjoints_ex6.C +++ b/examples/adjoints/adjoints_ex6/adjoints_ex6.C @@ -196,7 +196,7 @@ std::unique_ptr build_adjoint_refinement_error_estim adjoint_refinement_estimator->qoi_set() = qois; // We enrich the FE space for the dual problem by doing 2 uniform h refinements - adjoint_refinement_estimator->number_h_refinements = 2; + adjoint_refinement_estimator->number_h_refinements = 1; return std::unique_ptr(adjoint_refinement_estimator); } diff --git a/examples/adjoints/adjoints_ex6/general.in b/examples/adjoints/adjoints_ex6/general.in index 034b3be..aee9709 100644 --- a/examples/adjoints/adjoints_ex6/general.in +++ b/examples/adjoints/adjoints_ex6/general.in @@ -45,7 +45,7 @@ nelem_target = 40000 global_tolerance = 0 # Are we doing uniform refinement steps -refine_uniformly = true +refine_uniformly = false # Max number of refinements at each step refine_fraction = 0.05 @@ -54,10 +54,10 @@ refine_fraction = 0.05 coarsen_fraction = 0.0 # Coarsen threshold factor for refinement trading -coarsen_threshold = 0 +coarsen_threshold = 0.1 # The maximum number of adaptive steps per timestep -max_adaptivesteps = 2 +max_adaptivesteps = 3 # Use what finite element space? fe_family = LAGRANGE [03:33:56][pbauman@fry:/fry1/data/users/pbauman/research/libmesh/pbauman/examples/adjoints/adjoints_ex6][skip_part_in_copy_nodes_elems] $ Run with two processors and I get the following output: [03:33:05][pbauman@fry:/fry1/data/users/pbauman/research/libmesh/install/skip_part_in_copy_nodes_elems/examples/adjoints/ex6]$ mpiexec -np 2 ./example-devel
Started ./example-devel
Reading in and building the mesh
Building system
Initializing systems
Mesh Information:
elem_dimensions()={2}
spatial_dimension()=2
n_nodes()=81
n_local_nodes()=45
n_elem()=20
n_local_elem()=10
n_active_elem()=16
n_subdomains()=1
n_partitions()=2
n_processors()=2
n_threads()=1
processor_id()=0
EquationSystems
n_systems()=1
System #0, "PoissonSystem"
Type "Implicit"
Variables="T"
Finite Element Types="LAGRANGE", "JACOBI_20_00"
Infinite Element Mapping="CARTESIAN"
Approximation Orders="SECOND", "THIRD"
n_dofs()=81
n_local_dofs()=45
n_constrained_dofs()=32
n_local_constrained_dofs()=17
n_vectors()=2
n_matrices()=1
DofMap Sparsity
Average On-Processor Bandwidth <= 12.2222
Average Off-Processor Bandwidth <= 1.62963
Maximum On-Processor Bandwidth <= 25
Maximum Off-Processor Bandwidth <= 10
DofMap Constraints
Number of DoF Constraints = 32
Average DoF Constraint Length= 0
Number of Node Constraints = 0
Nonlinear solver converged, step 0, residual reduction 9.99412e-11 < 1e-09
Adaptive step 0, we have 16 active elements and 49 active dofs.
Postprocessing:
The computed QoI 0 is 36.230655305116564
The relative error in QoI 0 is 2.9365219718165605
Computing the error estimate using the Adjoint Refinement Error Estimator
The flux QoI 0 is: -36.230655307209972
The computed relative error in QoI 0 is 0.9275882192584336
The effectivity index for the computed error in QoI 0 is 0.31587988380846976
Refined mesh to 19 active elements and 57 active dofs.
Nonlinear solver converged, step 0, residual reduction 4.2366907426593718e-11 < 1.0000000000000001e-09
Adaptive step 1, we have 19 active elements and 57 active dofs.
Postprocessing:
The computed QoI 0 is 36.679825405894334
The relative error in QoI 0 is 3.3856920725943311
Computing the error estimate using the Adjoint Refinement Error Estimator
The flux QoI 0 is: -36.679825405637239
The computed relative error in QoI 0 is 0.86476474532195002
The effectivity index for the computed error in QoI 0 is 0.25541742331555645
Refined mesh to 22 active elements and 67 active dofs.
Nonlinear solver converged, step 0, residual reduction 5.6987413243126504e-11 < 1.0000000000000001e-09
Adaptive step 2, we have 22 active elements and 67 active dofs.
Postprocessing:
The computed QoI 0 is 37.000077436935747
The relative error in QoI 0 is 3.7059441036357441
Computing the error estimate using the Adjoint Refinement Error Estimator
The flux QoI 0 is: -37.008822648352393
The computed relative error in QoI 0 is 2.5286503300307319
The effectivity index for the computed error in QoI 0 is 0.68232284657234321
Refined mesh to 34 active elements and 107 active dofs.
Nonlinear solver converged, step 0, residual reduction 3.3895969456915461e-11 < 1.0000000000000001e-09
Adaptive step 3, we have 34 active elements and 107 active dofs.
Postprocessing:
The computed QoI 0 is 37.004887672501141
The relative error in QoI 0 is 3.710754339201138
Computing the error estimate using the Adjoint Refinement Error Estimator
The flux QoI 0 is: -36.986539654481803
Assertion `local_dof_bearing_nodes == final_local_dof_bearing_nodes' failed.
local_dof_bearing_nodes = 91
final_local_dof_bearing_nodes = 93
Assertion `local_dof_bearing_nodes == final_local_dof_bearing_nodes' failed.
local_dof_bearing_nodes = 76
final_local_dof_bearing_nodes = 74
[1] ../../pbauman/src/error_estimation/adjoint_refinement_estimator.C, line 573, compiled Aug 3 2018 at 14:33:53
[0] ../../pbauman/src/error_estimation/adjoint_refinement_estimator.C, line 573, compiled Aug 3 2018 at 14:33:53
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
|
|
Actually I don't even have to change adjoints_ex6.C. The following patch triggers failure on 2 processors: [03:40:23][pbauman@fry:/fry1/data/users/pbauman/research/libmesh/pbauman/examples/adjoints/adjoints_ex6][skip_part_in_copy_nodes_elems] $ git diff diff --git a/examples/adjoints/adjoints_ex6/general.in b/examples/adjoints/adjoints_ex6/general.in index 034b3be..85852e8 100644 --- a/examples/adjoints/adjoints_ex6/general.in +++ b/examples/adjoints/adjoints_ex6/general.in @@ -45,7 +45,7 @@ nelem_target = 40000 global_tolerance = 0 # Are we doing uniform refinement steps -refine_uniformly = true +refine_uniformly = false # Max number of refinements at each step refine_fraction = 0.05 @@ -54,10 +54,10 @@ refine_fraction = 0.05 coarsen_fraction = 0.0 # Coarsen threshold factor for refinement trading -coarsen_threshold = 0 +coarsen_threshold = 0.1 # The maximum number of adaptive steps per timestep -max_adaptivesteps = 2 +max_adaptivesteps = 4 # Use what finite element space? fe_family = LAGRANGE |
|
It's similar on the GRINS side. It takes several adaptive steps. It can happen with 1 or 2 h_adaptive_step in the error estimator. In the adjoints_ex6 case, with 2 h-adaptive steps in the error estimator, it took one more adaptive step to trigger the assert. |
|
For posterity, here's the output with the last patch I posted. [03:40:10][pbauman@fry:/fry1/data/users/pbauman/research/libmesh/install/skip_part_in_copy_nodes_elems/examples/adjoints/ex6]$ mpiexec -np 2 ./example-devel
Started ./example-devel
Reading in and building the mesh
Building system
Initializing systems
Mesh Information:
elem_dimensions()={2}
spatial_dimension()=2
n_nodes()=81
n_local_nodes()=45
n_elem()=20
n_local_elem()=10
n_active_elem()=16
n_subdomains()=1
n_partitions()=2
n_processors()=2
n_threads()=1
processor_id()=0
EquationSystems
n_systems()=1
System #0, "PoissonSystem"
Type "Implicit"
Variables="T"
Finite Element Types="LAGRANGE", "JACOBI_20_00"
Infinite Element Mapping="CARTESIAN"
Approximation Orders="SECOND", "THIRD"
n_dofs()=81
n_local_dofs()=45
n_constrained_dofs()=32
n_local_constrained_dofs()=17
n_vectors()=2
n_matrices()=1
DofMap Sparsity
Average On-Processor Bandwidth <= 12.2222
Average Off-Processor Bandwidth <= 1.62963
Maximum On-Processor Bandwidth <= 25
Maximum Off-Processor Bandwidth <= 10
DofMap Constraints
Number of DoF Constraints = 32
Average DoF Constraint Length= 0
Number of Node Constraints = 0
Nonlinear solver converged, step 0, residual reduction 9.99412e-11 < 1e-09
Adaptive step 0, we have 16 active elements and 49 active dofs.
Postprocessing:
The computed QoI 0 is 36.230655305116564
The relative error in QoI 0 is 2.9365219718165605
Computing the error estimate using the Adjoint Refinement Error Estimator
The flux QoI 0 is: -36.230655307209972
The computed relative error in QoI 0 is 2.4147326937047575
The effectivity index for the computed error in QoI 0 is 0.8223104464670431
Refined mesh to 19 active elements and 57 active dofs.
Nonlinear solver converged, step 0, residual reduction 4.2366907426593718e-11 < 1.0000000000000001e-09
Adaptive step 1, we have 19 active elements and 57 active dofs.
Postprocessing:
The computed QoI 0 is 36.679825405894334
The relative error in QoI 0 is 3.3856920725943311
Computing the error estimate using the Adjoint Refinement Error Estimator
The flux QoI 0 is: -36.679825405637239
The computed relative error in QoI 0 is 3.0694618459445815
The effectivity index for the computed error in QoI 0 is 0.90659805443929986
Refined mesh to 22 active elements and 67 active dofs.
Nonlinear solver converged, step 0, residual reduction 5.6987413243126504e-11 < 1.0000000000000001e-09
Adaptive step 2, we have 22 active elements and 67 active dofs.
Postprocessing:
The computed QoI 0 is 37.000077436935747
The relative error in QoI 0 is 3.7059441036357441
Computing the error estimate using the Adjoint Refinement Error Estimator
The flux QoI 0 is: -37.008822648352393
The computed relative error in QoI 0 is 3.6688957314081927
The effectivity index for the computed error in QoI 0 is 0.99000298677165566
Refined mesh to 28 active elements and 85 active dofs.
Nonlinear solver converged, step 0, residual reduction 9.2604373170744638e-11 < 1.0000000000000001e-09
Adaptive step 3, we have 28 active elements and 85 active dofs.
Postprocessing:
The computed QoI 0 is 35.413841394123793
The relative error in QoI 0 is 2.1197080608237897
Computing the error estimate using the Adjoint Refinement Error Estimator
The flux QoI 0 is: -35.435965766674194
The computed relative error in QoI 0 is 2.0922047312006988
The effectivity index for the computed error in QoI 0 is 0.98702494455184453
Refined mesh to 34 active elements and 105 active dofs.
Nonlinear solver converged, step 0, residual reduction 7.1958162095113093e-11 < 1.0000000000000001e-09
Adaptive step 4, we have 34 active elements and 105 active dofs.
Postprocessing:
The computed QoI 0 is 34.979863391304818
The relative error in QoI 0 is 1.6857300580048147
Computing the error estimate using the Adjoint Refinement Error Estimator
The flux QoI 0 is: -34.975506469825348
Assertion `local_dof_bearing_nodes == final_local_dof_bearing_nodes' failed.
local_dof_bearing_nodes = 90
final_local_dof_bearing_nodes = 91
Assertion `local_dof_bearing_nodes == final_local_dof_bearing_nodes' failed.
local_dof_bearing_nodes = 79
final_local_dof_bearing_nodes = 78
[0] ../../pbauman/src/error_estimation/adjoint_refinement_estimator.C, line 573, compiled Aug 3 2018 at 14:33:53
[1] ../../pbauman/src/error_estimation/adjoint_refinement_estimator.C, line 573, compiled Aug 3 2018 at 14:33:53
|
|
Thanks Paul - it's so weird... are those cases even explicitly turning off
partitioning?
…On Fri, Aug 3, 2018 at 1:52 PM Paul T. Bauman ***@***.***> wrote:
For posterity, here's the output with the last patch I posted.
***@***.***:/fry1/data/users/pbauman/research/libmesh/install/skip_part_in_copy_nodes_elems/examples/adjoints/ex6]$ mpiexec -np 2 ./example-devel
Started ./example-devel
Reading in and building the mesh
Building system
Initializing systems
Mesh Information:
elem_dimensions()={2}
spatial_dimension()=2
n_nodes()=81
n_local_nodes()=45
n_elem()=20
n_local_elem()=10
n_active_elem()=16
n_subdomains()=1
n_partitions()=2
n_processors()=2
n_threads()=1
processor_id()=0
EquationSystems
n_systems()=1
System #0, "PoissonSystem"
Type "Implicit"
Variables="T"
Finite Element Types="LAGRANGE", "JACOBI_20_00"
Infinite Element Mapping="CARTESIAN"
Approximation Orders="SECOND", "THIRD"
n_dofs()=81
n_local_dofs()=45
n_constrained_dofs()=32
n_local_constrained_dofs()=17
n_vectors()=2
n_matrices()=1
DofMap Sparsity
Average On-Processor Bandwidth <= 12.2222
Average Off-Processor Bandwidth <= 1.62963
Maximum On-Processor Bandwidth <= 25
Maximum Off-Processor Bandwidth <= 10
DofMap Constraints
Number of DoF Constraints = 32
Average DoF Constraint Length= 0
Number of Node Constraints = 0
Nonlinear solver converged, step 0, residual reduction 9.99412e-11 < 1e-09
Adaptive step 0, we have 16 active elements and 49 active dofs.
Postprocessing:
The computed QoI 0 is 36.230655305116564
The relative error in QoI 0 is 2.9365219718165605
Computing the error estimate using the Adjoint Refinement Error Estimator
The flux QoI 0 is: -36.230655307209972
The computed relative error in QoI 0 is 2.4147326937047575
The effectivity index for the computed error in QoI 0 is 0.8223104464670431
Refined mesh to 19 active elements and 57 active dofs.
Nonlinear solver converged, step 0, residual reduction 4.2366907426593718e-11 < 1.0000000000000001e-09
Adaptive step 1, we have 19 active elements and 57 active dofs.
Postprocessing:
The computed QoI 0 is 36.679825405894334
The relative error in QoI 0 is 3.3856920725943311
Computing the error estimate using the Adjoint Refinement Error Estimator
The flux QoI 0 is: -36.679825405637239
The computed relative error in QoI 0 is 3.0694618459445815
The effectivity index for the computed error in QoI 0 is 0.90659805443929986
Refined mesh to 22 active elements and 67 active dofs.
Nonlinear solver converged, step 0, residual reduction 5.6987413243126504e-11 < 1.0000000000000001e-09
Adaptive step 2, we have 22 active elements and 67 active dofs.
Postprocessing:
The computed QoI 0 is 37.000077436935747
The relative error in QoI 0 is 3.7059441036357441
Computing the error estimate using the Adjoint Refinement Error Estimator
The flux QoI 0 is: -37.008822648352393
The computed relative error in QoI 0 is 3.6688957314081927
The effectivity index for the computed error in QoI 0 is 0.99000298677165566
Refined mesh to 28 active elements and 85 active dofs.
Nonlinear solver converged, step 0, residual reduction 9.2604373170744638e-11 < 1.0000000000000001e-09
Adaptive step 3, we have 28 active elements and 85 active dofs.
Postprocessing:
The computed QoI 0 is 35.413841394123793
The relative error in QoI 0 is 2.1197080608237897
Computing the error estimate using the Adjoint Refinement Error Estimator
The flux QoI 0 is: -35.435965766674194
The computed relative error in QoI 0 is 2.0922047312006988
The effectivity index for the computed error in QoI 0 is 0.98702494455184453
Refined mesh to 34 active elements and 105 active dofs.
Nonlinear solver converged, step 0, residual reduction 7.1958162095113093e-11 < 1.0000000000000001e-09
Adaptive step 4, we have 34 active elements and 105 active dofs.
Postprocessing:
The computed QoI 0 is 34.979863391304818
The relative error in QoI 0 is 1.6857300580048147
Computing the error estimate using the Adjoint Refinement Error Estimator
The flux QoI 0 is: -34.975506469825348
Assertion `local_dof_bearing_nodes == final_local_dof_bearing_nodes' failed.
local_dof_bearing_nodes = 90
final_local_dof_bearing_nodes = 91
Assertion `local_dof_bearing_nodes == final_local_dof_bearing_nodes' failed.
local_dof_bearing_nodes = 79
final_local_dof_bearing_nodes = 78
[0] ../../pbauman/src/error_estimation/adjoint_refinement_estimator.C, line 573, compiled Aug 3 2018 at 14:33:53
[1] ../../pbauman/src/error_estimation/adjoint_refinement_estimator.C, line 573, compiled Aug 3 2018 at 14:33:53
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1815 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA1JMQALqif44_bAgOem4PWmIhWzrEAaks5uNKoLgaJpZM4VkkyY>
.
|
Agreed, this is really weird.
Certainly not on the GRINS side; I would need to read adjoints_ex6 more carefully to confirm, but I doubt it's being set in there either. We need to carefully go through |
|
So it looks like partitioning is getting disabled because the partitioner gets detached during the refinement+compute error estimater + coarsen step and then restored at the end. But it looks like we don't want repartitioning in there. |
|
I should say I'm intimately familiar with this theory, but @roystgnr and @vikramvgarg are the ones that wrote this code, so they'll be more familiar with implementation here. |
|
That commit 9b5553d is basically a complicated assertion - if you were to take it out then the ARefEE-using code would still break in every case, but the breakage would be slightly later and greatly more confusing. We definitely don't want repartitioning during ARefEE (or URefEE), 1% for efficiency's sake and 99% because we're caching an arbitrary number of coarse vectors that we hope to just straight-copy back into place in a system that has been restored to have the exact same partitioning and DoF number ordering that it started with. I've got like 10 minutes more before family-vacation-duties trump you guys, so I'm afraid I don't have time to look into this deeply. Paul, if you're jiggering refinement settings around, is there any chance you can get the same failure to trigger without this PR? There's some complicated code here and it's not impossible that it's currently broken in master but only triggering now. But I don't think that's the case. I think what's happening is that when doing coarsening of a second-order-element AMR mesh we may need to correct node proc ids, due to one of those crazy cases on hanging nodes. Picture a coarse QUAD9 on proc 0 next to 4 refined QUAD9s on proc 1. There are 5 nodes on the shared coarse edge, and under our default node partitioning three of them belong to proc 0 (because they're part of the coarse quad) and two (the midedge nodes on the refined quads) belong to proc 1. Refine this, and now suddenly the two nodes from proc 1 canonically belong to proc 0, and we'll assign them accordingly. Coarsen this again, and if you want the node partitioning to match the original node partitioning again (as we demand here) then you have to correct_node_proc_ids, even if we haven't repartitioned any elements! I'm not sure what the best way is to make sure that correction happens in such cases but does not happen during mesh stitching. |
|
Hmm... or perhaps the way to hit things is to go further, and make sure that those midpoint nodes are never reassigned to proc 0 in the first place? That's probably the best long-term fix even if it's not the easiest or quickest short-term fix. Anyway, AFK for a while. |
|
I don't think I'll find enough time to fix this soon but I can at least replicate it and I understand it a bit better. It's not just a refinement-based-error-estimator problem; those are just the first code paths to catch an effect. In any mesh coarsening, there's the possibility that we may have a node whose processor id was previously consistent with the elements sharing it but now needs to change after the last element(s) sharing that id have vanished. And from there the trouble snowballs: on a distributed mesh, when a node needs to be changed for this reason, there may be any number of other processors on which that node is ghosted but on which not enough of the elements sharing it have been ghosted for us to determine the problem. The current "send queries for all ghost nodes" code might be greatly improved upon, perhaps? If we somehow did "send queries for all ghost nodes which might be candidates for a remote_elem-coarsening-forced pid change, and compute locally on any others" that would be a ton less communication. (in particular it would mean no communication at all on serialized meshes) The trouble is determining (in the distributed mesh case) which nodes are such candidates. Right now we even support meshes like domains that self-intersect at a single vertex with no neighbor links between the two sides sharing that vertex, but with no such links available the vertex owner doesn't necessarily even know that other processors have elements using it too! Even without crazy domains, supporting arbitrary ghosting patterns means that a node owner doesn't know what other processors have it ghosted. As a first step, it would be possible to not send queries for any nodes that are definitely still consistent, when repartitioning is off. If I find any time to code I'll try that out. |
|
@roystgnr any movement on this? It's holding up idaholab/moose#11953 |
|
Is there even something simple I can do for now? |
|
Maybe in the cases where you need this behavior we could just call |
|
Wait, this is a correctness issue, not just a performance optimization? I'll try and give it another look today. |
|
@roystgnr yeah, it is a correctness problem. The issue is that when doing n->m mesh splitting you don't want the mesh to be partitioned during reading/generation because it is going to be partitioned by the splitting process after that - and some partitioners only work on certain numbers of processors - like the GridPartitioner ( http://mooseframework.org/moose/source/partitioner/GridPartitioner.html ). Without this fix it doesn't truly turn off partitioning completely which generates errors with partitioners like that. |
|
Note to self so I don't waste time again later: none of the above breakage replicates with my default --enable-parmesh configuration; I have to be running with a default ReplicatedMesh to trigger it. (That isn't because DistributedMesh doesn't fall victim the same problems, but if we want to trigger those problems with a different partitioning we're going to need different settings to get the exact same corner case to show up.) |
|
So... the underlying problem here isn't node repartitioning, it's the utter failure of our communications algorithms (both old and new) to do the right thing in n->m situations, when my assumption that a DofObject with processor_id m corresponds to an actual processor m is completely wrong. We're actually already carefully avoiding unnecessary node repartitioning in the skip_repartitioning code path, and have been since b5a39e0 in #1659, which limits exceptions to only nodes whose prior assignment has been invalidated by adaptive coarsening. But the mere act of trying to communicate which nodes have been so orphaned triggers the communication bugs, just as in #1950. Let me see if I can set up a test replicating the problem case(s) first, then I'll try to base a fix on top of #1965. |
|
This is still important - regardless of communication algorithm and n->m stuff. If we want to skip partitioning - then... it should be skipped! I'm back in here commenting about this because I noticed, once again, that copy_nodes_and_elements() is incredibly slow. This time because of mesh stitching (look for another PR soon to try to speed up mesh stitching by doing less prepare_for_use() during stitching... but this PR is also important). In my opinion - any code that relies on partitioning happening even though we've explicitly told libMesh not to partition... is broken code that should be removed until it can be refactored (or another workaround is found). When building large meshes (100M+ elements) with a lot of stitching this one thing can cost hours. |
You want libMesh to support nodes that aren't connected to any element owned by their owner? If we're going to refactor half the library then I'd rather get a benefit like "anisotropic refinement", not a benefit like "we no longer have to reassign inconsistently partitioned nodes when coarsening". |
|
+1 on "anisotropic refinement" ;-)
…On Wed, Mar 6, 2019, 7:50 AM roystgnr ***@***.***> wrote:
In my opinion - any code that relies on partitioning happening even though
we've explicitly told libMesh not to partition... is broken code that
should be removed until it can be refactored (or another workaround is
found).
You want libMesh to support nodes that aren't connected to any element
owned by their owner? If we're going to refactor half the library then I'd
rather get a benefit like "anisotropic refinement", not a benefit like "we
no longer have to reassign inconsistently partitioned nodes when
coarsening".
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1815 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAMWPiOWJ5Q7RO7zFPVnr5xsPVAoZ8ruks5vT9WogaJpZM4VkkyY>
.
|
I see my comments at the SIAM CSE tutorial along the lines of that "being a huge amount of work" hasn't dissuaded @dschwen. |
|
@roystgnr I don't understand your statement - what does that have to do with turning off partitioning? If someone has turned off partitioning - then it should be off. Just because there is a workflow where if partitioning is turned off then bad stuff happens... that doesn't mean I shouldn't be able to turn it off and have that actually mean that it's off. I feel like we're increasingly having these types of discussions. What is the role of libMesh? Is it to hold your hand and make sure nothing bad can ever happen... or is it to be flexible and do what it has been prescribed to do. To me: if I tell a piece of software that I want to do something (like NOT partition)... and it decides to do it anyway... that's a BUG. What do we have I understand that this breaks one small use-case. But that small use-case should have never been dependent on this anyway... and a workaround should be made in that use-case. Not keep something fundamental (turning off partitioning) broken for everyone else. |
To skip partitioning of elements and validly partitioned nodes.
Calling parallel coarsening a "small use-case" is ridiculous.
I agree, but that's where we were. The alternatives then were "add a skip_partitioning() option that has a workaround" or "don't add a skip_partitioning() option".
So, something like, "We're actually already carefully avoiding unnecessary node repartitioning in the skip_repartitioning code path, and have been since b5a39e0 in #1659, which limits exceptions to only nodes whose prior assignment has been invalidated by adaptive coarsening." perhaps? The rest of that comment would probably be helpful for you too. |
The "small use-case" is: calling parallel coarsening with THAT's a small use-case... and is obviously just an invalid thing to do. But instead of making You're still dodging the main question: if I set That's the only question here. |
|
So: my proposal is this:
|
|
This is a one-line PR that has been here since July. This is important to us (I lost another day yesterday to it). There needs to be some resolution here. Should we add |
|
Similarly we should deprecate |
|
Hi Derek,
We are using Adjoint Refinement Error Estimator for a
lot of our work and have referenced it as a capability of libMesh in
publications. There has to be a better way to sort this out rather
than effectively undoing a substantial amount of work which ties in
very strongly with present and future work for both Roy, Paul and I at
the very least.
Thanks.
Vikram Garg
vikramvgarg.github.io/
…On Wed, Mar 6, 2019 at 10:56 AM Derek Gaston ***@***.***> wrote:
Similarly we should deprecate skip_partitioning() in favor of possibly_skip_partitioning() so that people aren't confused....
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
|
I agree this should be resolved; breaking others' code just isn't the way to do it. Neither is indefinitely disabling a major libMesh feature. So, counterproposal: 1: Send me something that replicates the problems. (something(s), assuming the GridPartitioner incompatibility and the mesh stitching inefficiency can't be Extra credit: Let me know if #1965 has any effect on either for you. I'll check myself if you don't have time. If the answer to 2 is "no" or "only on one" then I've got an idea for a somewhat hackish temporary fix, and if that works then we can get it in shortly while I figure out what the underlying problem is. |
|
Ok, to recap that phone call:
And damn it. @friedmud just buzzed me to point out that the serial check isn't going to work for everyone. Let me see what I can hack together instead. |
|
This got included into #2061 |
Huge optimization for stitching meshes...
MeshTools::correct_node_proc_ids()was taking a HUGE amount of time... and is completely unnecessary in this situation.