Skip to content

Investigate CI failures#56

Closed
inducer wants to merge 4 commits into
masterfrom
is-numpy-1.20-busted
Closed

Investigate CI failures#56
inducer wants to merge 4 commits into
masterfrom
is-numpy-1.20-busted

Conversation

@inducer

@inducer inducer commented Feb 4, 2021

Copy link
Copy Markdown
Owner

@inducer

inducer commented Feb 4, 2021

Copy link
Copy Markdown
Owner Author

Huh. Works like a charm. I'll rerun to see if it's a fluke.

@alexfikl

alexfikl commented Feb 4, 2021

Copy link
Copy Markdown
Collaborator

Seems to have failed again. Hm, so it's not numpy and not related to anything in #55.

@inducer inducer changed the title Test numpy < 1.20 from conda-forge Investigate CI failures Feb 4, 2021
@inducer

inducer commented Feb 4, 2021

Copy link
Copy Markdown
Owner Author

Thanks @isuruf for checking for sumpy influence!

I just added a run without

cc @kaushikcfd

@kaushikcfd

Copy link
Copy Markdown

Hmm. CI/examples passed when an older loopy is pinned. I don't have a fix yet, but:

  1. locally laplce-dirichlet-3d.py passes for me. (for both loopy versions)
  2. iname tags of kernels generated by the 2 commits with discrepancy are identical.

@inducer

inducer commented Feb 4, 2021

Copy link
Copy Markdown
Owner Author

So the first go passed. Now running a second round to see if it's repeatable.

@inducer

inducer commented Feb 5, 2021

Copy link
Copy Markdown
Owner Author

For perspective, the examples failure is distinct. Full story at #57. This did not exhibit the failure I am concerned about here, which is the main Linux pytest failure. I'll rerun again, to see if it holds up.

@alexfikl

alexfikl commented Feb 6, 2021

Copy link
Copy Markdown
Collaborator

Did the Linux tests fail after numpy was bumped down? I just remember the examples failing in a while now.

Those failures really looked like some sort of out of memory issue. Did https://gitlab.tiker.net/inducer/pytential/-/issues/131 ever improve?

@inducer

inducer commented Feb 8, 2021

Copy link
Copy Markdown
Owner Author

The only way I know to curb this misery going forward is running downstream CI along with upstream projects, in this case loopy, as I propose here: inducer/loopy#220. If that works out, I'll probably apply the same idea to meshmode (inducer/meshmode#113).

@inducer

inducer commented Feb 8, 2021

Copy link
Copy Markdown
Owner Author

Did the Linux tests fail after numpy was bumped down?

I don't recall such an instance. I'll bump numpy back up here, to check. But I don't expect it to fail.

Those failures really looked like some sort of out of memory issue.

I agree, though I don't see (yet?) how the loopy PRs would inflate memory usage in a substantial fashion.

Did https://gitlab.tiker.net/inducer/pytential/-/issues/131 ever improve?

No, didn't. It just wasn't bad enough to be a problem. In addition, there's a similar-looking mystery (illinois-ceesd/mirgecom#212) being chased down in mirgecom.

@alexfikl

alexfikl commented Feb 8, 2021

Copy link
Copy Markdown
Collaborator

I don't recall such an instance. I'll bump numpy back up here, to check. But I don't expect it to fail.

It seems to have failed, rerun?

No, didn't. It just wasn't bad enough to be a problem. In addition, there's a similar-looking mystery (illinois-ceesd/mirgecom#212) being chased down in mirgecom.

Maybe worth adding a memory pool already in pytest_generate_tests_for_pyopencl_array_context? Although yeah, that would just hide the issue.

@inducer

inducer commented Feb 8, 2021

Copy link
Copy Markdown
Owner Author

It seems to have failed, rerun?

Wha? 🤯

Sure, I'll rerun, but now I don't know what to believe. Is this something that's brought about by numpy or the loopy or both?

@inducer

inducer commented Feb 8, 2021

Copy link
Copy Markdown
Owner Author

Maybe worth adding a memory pool already in pytest_generate_tests_for_pyopencl_array_context? Although yeah, that would just hide the issue.

Ugh, no. Not a fan of sweeping stuff under the rug.

@alexfikl

alexfikl commented Feb 8, 2021

Copy link
Copy Markdown
Collaborator

Sure, I'll rerun, but now I don't know what to believe. Is this something that's brought about by numpy and the loopy change together?

Just to add another variable: looking at the CI history, last scheduled run on Ubuntu 18.04 passed just fine, but then the next ones on Ubuntu 20.04 started failing. Can we pin it to Ubuntu 18.04 to see if that passes reliably?

Besides that, no idea, since it seems to pass intermittently..

@inducer

inducer commented Feb 8, 2021

Copy link
Copy Markdown
Owner Author

Hmm, so possibly the common theme among all these changes (newer ubuntu, loopy PRs, numpy 1.20) is just that they each ever so slightly increase memory usage...

@inducer

inducer commented Feb 8, 2021

Copy link
Copy Markdown
Owner Author

Passed this time around, FWIW.

@inducer

inducer commented Feb 9, 2021

Copy link
Copy Markdown
Owner Author

Alright, I'm now super confused. Reverting the Loopy PRs that we suspected caused problems actually did exactly nothing to help inducer/loopy#220 pass. So that theory is pretty dead in the water to me.

@inducer

inducer commented Feb 9, 2021

Copy link
Copy Markdown
Owner Author

My next best plan is to go hunt this stupid memory leak. Grrr.

@inducer

inducer commented Feb 9, 2021

Copy link
Copy Markdown
Owner Author

illinois-ceesd/mirgecom#212 if you'd like to follow the saga.

@inducer

inducer commented Feb 10, 2021

Copy link
Copy Markdown
Owner Author

Using jemalloc for the CI (#58) seems to help. See illinois-ceesd/mirgecom#212 for more details. Closing here.

@inducer inducer closed this Feb 10, 2021
@inducer inducer deleted the is-numpy-1.20-busted branch February 10, 2021 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants