Refactor test runner to support more pinning options by bryevdv · Pull Request #535 · nv-legate/cupynumeric

bryevdv · 2022-08-16T16:47:05Z

This PR updates the test runner test.py in the following was:

Restores setting {"REALM_SYNTHETIC_CORE_MAP": ""} by default
Adds --strict-pin option to not set REALM_SYNTHETIC_CORE_MAP and also account for a Python processor in the worker algorithm (in addition to specified utility)
Combines hyper-cores for purposes of passing to --cpu-bind (i.e. for 2-way HT --cpu-bind will always receive N pairs of ids, where each pair is two HT sibling cores)
Changes --fbmem units to MB
Adds some missing tests for the test runner itself (not currently run in CI)

In "strict" mode, there should not be any Realm reservation warnings such as

[0 - 7efd582f7000]    0.000052 {4}{threads}: reservation ('CPU proc 1d00000000000003') cannot be satisfied

cc @manopapad @magnatelee This current PR changes the algorithm to unconditionally "combine" hyper-cores before passing to --bind-cpu. Do we want to expose an option to not do that at as well?

bryevdv · 2022-08-16T16:51:14Z

@magnatelee you also mentioned wanting to reduce GPU oversubscription, that would be easy to roll into this PR if you can describe the changes, e.g. increasing the default fbmem value to more than 4Gb? Alternatively we could increase the bloat factor, but note that currently the bloat factor does not affect the value passed to --fbmem for the individual test (or should it? currently it just feeds in to adjusts the number of workers)

magnatelee · 2022-08-16T21:26:06Z

This current PR changes the algorithm to unconditionally "combine" hyper-cores before passing to --bind-cpu. Do we want to expose an option to not do that at as well?

That option is useful to have I guess, but I've never found it helpful to split hyperthreads across different processes.

you also mentioned wanting to reduce GPU oversubscription, that would be easy to roll into this PR if you can describe the changes, e.g. increasing the default fbmem value to more than 4Gb? Alternatively we could increase the bloat factor, but note that currently the bloat factor does not affect the value passed to --fbmem for the individual test (or should it? currently it just feeds in to adjusts the number of workers)

I suspect the oversubscription factor isn't causing the intermittent failure, but back-to-back execution is. FYI, I'm seeing the failure, though less frequently, even with the reduced oversubscription. So, it might be simpler to just add delay between tests within the same shard. Even better is to detect the case that the test fails due to starting the test prematurely and retry that test.

bryevdv · 2022-08-16T23:16:29Z

That option is useful to have I guess, but I've never found it helpful to split hyperthreads across different processes.

I might have explained poorly, because this sounds the opposite of what is happening. In more detail, assume there are hyper-core sibling pairs:

(0, 4)  (1, 5)  (2, 6)  (3, 7)

The change in the PR means:

There are 4 (not 8) CPUs for purposes of test running, and
if the "first" and "second" CPUs are in a shard, it is --cpu-bind 0,4,1,5

Even better is to detect the case that the test fails due to starting the test prematurely and retry that test.

I don't think there is any way to know that a test failed for any particular reason, only that it failed. If you think a generic "retry N times" option is valuable, I could add that in a dedicated feature PR.

manopapad · 2022-08-17T03:17:12Z

The change in the PR means:

There are 4 (not 8) CPUs for purposes of test running, and

if the "first" and "second" CPUs are in a shard, it is --cpu-bind 0,4,1,5

I agree with this behavior. The alternative would be to allow splitting a physical core's two virtual cores betwen two different workers, which Wonchan and I seem to agree is not that useful.

I also agree that we should allocate a full physical core for each CPU/OMP processor, even when strict pinning is turned off.

bryevdv · 2022-08-17T16:50:45Z

So, it might be simpler to just add delay between tests within the same shard.

@magnatelee I would make this configurable, but how big of a delay should be used by default? Should this only apply to gpu tests?

magnatelee · 2022-08-17T17:41:40Z

@magnatelee I would make this configurable, but how big of a delay should be used by default? Should this only apply to gpu tests?

Let's put 2 seconds between tests by default. (There's no reasoning behind this number though.) And yes, this is necessary only for GPU tests. I think in a follow-up PR we should do this in a proper way by polling the GPU state to wait until the previous test processes relinquish the GPUs.

bryevdv · 2022-08-17T18:06:41Z

@magnatelee please check out cbfb0c7

* Refactor test runner to support more pinning options * add --gpu-delay option

Refactor test runner to support more pinning options

5db64d8

manopapad approved these changes Aug 17, 2022

View reviewed changes

add --gpu-delay option

cbfb0c7

bryevdv merged commit 86b7a62 into nv-legate:branch-22.10 Aug 17, 2022

bryevdv deleted the bv/test_pinning_options branch August 17, 2022 20:59

jjwilke pushed a commit to jjwilke/cunumeric that referenced this pull request Sep 2, 2022

Refactor test runner to support more pinning options (nv-legate#535)

a923c24

* Refactor test runner to support more pinning options * add --gpu-delay option

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor test runner to support more pinning options#535

Refactor test runner to support more pinning options#535
bryevdv merged 2 commits intonv-legate:branch-22.10from
bryevdv:bv/test_pinning_options

bryevdv commented Aug 16, 2022 •

edited

Loading

Uh oh!

bryevdv commented Aug 16, 2022 •

edited

Loading

Uh oh!

magnatelee commented Aug 16, 2022 •

edited

Loading

Uh oh!

bryevdv commented Aug 16, 2022

Uh oh!

manopapad commented Aug 17, 2022

Uh oh!

bryevdv commented Aug 17, 2022

Uh oh!

magnatelee commented Aug 17, 2022 •

edited

Loading

Uh oh!

bryevdv commented Aug 17, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bryevdv commented Aug 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bryevdv commented Aug 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

magnatelee commented Aug 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bryevdv commented Aug 16, 2022

Uh oh!

manopapad commented Aug 17, 2022

Uh oh!

bryevdv commented Aug 17, 2022

Uh oh!

magnatelee commented Aug 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bryevdv commented Aug 17, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bryevdv commented Aug 16, 2022 •

edited

Loading

bryevdv commented Aug 16, 2022 •

edited

Loading

magnatelee commented Aug 16, 2022 •

edited

Loading

magnatelee commented Aug 17, 2022 •

edited

Loading