Skip to content

hypoelastic examples cases fix#152

Merged
sbryngelson merged 1 commit into
MFlowCode:masterfrom
wilfonba:hypoFix
May 24, 2023
Merged

hypoelastic examples cases fix#152
sbryngelson merged 1 commit into
MFlowCode:masterfrom
wilfonba:hypoFix

Conversation

@wilfonba
Copy link
Copy Markdown
Contributor

Hypoelastic test cases wouldn't run because case_dir was removed from the code

@wilfonba wilfonba requested a review from sbryngelson as a code owner May 16, 2023 17:44
@wilfonba
Copy link
Copy Markdown
Contributor Author

Not sure why this failed. Based on the case it failed on, it might have to do with #151? @lee-hyeoksu

@sbryngelson
Copy link
Copy Markdown
Member

#151 involved creating new golden files. maybe that's the problem you're experiencing @wilfonba ?

@wilfonba
Copy link
Copy Markdown
Contributor Author

wilfonba commented May 16, 2023

It fails when doing the silo check for NaNs. The CI jobs that don't check for NaNs in the silo file all pass

@hyeoksu-lee
Copy link
Copy Markdown
Contributor

hyeoksu-lee commented May 17, 2023

I am trying to replicate the NaN on my local MFC, but the NaN seems not appear on the test.

Could you let me know how Test Suite (ubuntu-latest, --no-debug, false) and Test Suite (ubuntu-latest, --no-debug, true, source /opt/intel/oneapi/setvars.sh) are different? Simply the former does not check for NaNs? @wilfonba

@wilfonba
Copy link
Copy Markdown
Contributor Author

I believe --no-debug, true, source /opt/intel/oneapi/setvars.sh) use Intel compilers. It must be compiler related since the GNU and NVHPC tests work fine. @anshgupta1234 added intel compilers before these cases were added, but it's still weird that they all passed when the PR was merged.

@hyeoksu-lee
Copy link
Copy Markdown
Contributor

I see @wilfonba . I don't know much about compiler specific coding stuff, but I believe #151 did not include something dependent on specific compilers. Is there any part where I have to consider compilers other than common/m_compile_specific.f90?

@wilfonba
Copy link
Copy Markdown
Contributor Author

Yeah, I don't know much about what that file does. What's weird is that the cases were fine for your PR, and this PR did nothing but change two example case files, and it's suddenly not fine. It doesn't make much sense. Maybe @henryleberre has an idea if @anshgupta1234 is busy?

@hyeoksu-lee
Copy link
Copy Markdown
Contributor

It is a very very rare case, but sometimes compilers have random errors. So just for check, how about to run the test again? @wilfonba

@wilfonba
Copy link
Copy Markdown
Contributor Author

It's failed on the CI for my other pull request, on this case or a similar one, several times, but only with the Intel compilers.

@hyeoksu-lee
Copy link
Copy Markdown
Contributor

I see, then it would be not the case.

@sbryngelson
Copy link
Copy Markdown
Member

@lee-hyeoksu @wilfonba I just reran the CIs, but it looks like this is an actual problem and possibly has to do with Intel compilers. @lee-hyeoksu can you try loading the latest Intel compilers on Bridges2 or some other computer and then making sure it passes the tests?

@sbryngelson
Copy link
Copy Markdown
Member

This is now passing @wilfonba. I am not sure what the issue was.

@wilfonba
Copy link
Copy Markdown
Contributor Author

No idea. Maybe an issue with whatever hardware/software GitHub is using for CI.

@hyeoksu-lee
Copy link
Copy Markdown
Contributor

I am currently setting up my access to Bridges2. I believe I will be able to test on it in a few days. The CI now passed but anyway I will test this on Bridges2 and I will let you know if there are issues.

@sbryngelson
Copy link
Copy Markdown
Member

@lee-hyeoksu I re-ran the CI on the PR I merged from you on viscous + bubbles last week, and it actually failed at the Intel CPU test on your case: https://github.com/MFlowCode/MFC/actions/runs/5020414923 . I'm wondering if this is sporadic bug reproducible on a non-GitHub machine with Intel compilers. Since it is only happening for your PR, I suspect it has something to do with the code you added there.

@hyeoksu-lee
Copy link
Copy Markdown
Contributor

I tested on Bridges using intel 2021.3.0 compilers. I got NaN error on test 55533234 (2D -> bc=-1) and 6FC6A809 (3D -> bc=-1) as attached (the screenshot is for test 55533234). For 6FC6A809, the NaN also appeared in time step 50 of 51.

Screenshot 2023-05-21 at 12 51 30 PM

So, I believe there is something incompatible between MFC and intel compilers. Actually when I compile MFC, there were a lot of warnings for silo and hdf5. Could these warnings be related to these kinds of errors?

@wilfonba
Copy link
Copy Markdown
Contributor Author

That error is thrown on step 51 when MFC tries to save a binary file (or whatever it is). This occurs before any silo or hdf5 routines are called, I believe, so this would have to be something wrong with MFC + Intel compilers. Do the serial output files in D/ have NaNs?

@sbryngelson
Copy link
Copy Markdown
Member

Interesting. @lee-hyeoksu what happens if you use an openmpi-intel module? IntelMPI and OpenMPI are different, so this might help triangulate the problem.

@hyeoksu-lee
Copy link
Copy Markdown
Contributor

@wilfonba I checked the serial outputs in D/ but actually they don't have NaNs. Also @sbryngelson I tried openmpi/4.0.2-intel20.4 but still NaNs occur.

@hyeoksu-lee
Copy link
Copy Markdown
Contributor

hyeoksu-lee commented May 22, 2023

I tried to figure out what happens in test 555332334 and I found that the NaNs actually occur at the first time step when populating buffers in s_populate_conservative_variables_buffers in m_rhs.fpp.

In this subroutine, global parameters bc_s%beg and bc_s%end are used for s-direction (s = x, y, or z) to populate buffers based on BCs. However, in the subroutine, the values for these parameters are 0, although they should be some negative integers. So the subroutine does not assign appropriate values to q_cons_qp in buffer regions, which leads to NaNs.

bc_x and bc_y become 0 right after returning from s_read_data_files(q_cons_ts(1)%vf) (https://github.com/MFlowCode/MFC/blob/master/src/simulation/p_main.fpp#L185).

Still I am not sure why this happens for intel compilers, so I keep looking into this issue but I just wanted to share. Any suggestions would be appreciated!

@sbryngelson
Copy link
Copy Markdown
Member

I can't get any of the Intel compilers to work on Bridges2... any ideas anyone? @henryleberre @anshgupta1234

@hyeoksu-lee
Copy link
Copy Markdown
Contributor

hyeoksu-lee commented May 23, 2023

I tried intel compiler with optimization level O0 (no optimization) by adding a line of code add_compile_options(-O0) in CMakeLists.txt (https://github.com/MFlowCode/MFC/blob/master/CMakeLists.txt#L107). This change makes all tests pass on Bridges2. Default optimization level for intel compiler is O2. So, I think there is a problem with intel compiler optimization.

Also, all tests pass with optimization level O1. @wilfonba Could you try this change on your PS?

@sbryngelson
Copy link
Copy Markdown
Member

sbryngelson commented May 23, 2023

@lee-hyeoksu Thanks, this is helpful to know. I have moved discussion of this problem to an issue: #156 . I may create a separate issue if intelmpi modules create a different problem.

Update: intelmpi fails the same way as the above case does.

@sbryngelson sbryngelson merged commit 86823bf into MFlowCode:master May 24, 2023
@wilfonba wilfonba deleted the hypoFix branch August 3, 2023 04:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants