Skip to content

Refactor test runner to support more pinning options#535

Merged
bryevdv merged 2 commits intonv-legate:branch-22.10from
bryevdv:bv/test_pinning_options
Aug 17, 2022
Merged

Refactor test runner to support more pinning options#535
bryevdv merged 2 commits intonv-legate:branch-22.10from
bryevdv:bv/test_pinning_options

Conversation

@bryevdv
Copy link
Contributor

@bryevdv bryevdv commented Aug 16, 2022

This PR updates the test runner test.py in the following was:

  • Restores setting {"REALM_SYNTHETIC_CORE_MAP": ""} by default
  • Adds --strict-pin option to not set REALM_SYNTHETIC_CORE_MAP and also account for a Python processor in the worker algorithm (in addition to specified utility)
  • Combines hyper-cores for purposes of passing to --cpu-bind (i.e. for 2-way HT --cpu-bind will always receive N pairs of ids, where each pair is two HT sibling cores)
  • Changes --fbmem units to MB
  • Adds some missing tests for the test runner itself (not currently run in CI)

In "strict" mode, there should not be any Realm reservation warnings such as

[0 - 7efd582f7000]    0.000052 {4}{threads}: reservation ('CPU proc 1d00000000000003') cannot be satisfied 

cc @manopapad @magnatelee This current PR changes the algorithm to unconditionally "combine" hyper-cores before passing to --bind-cpu. Do we want to expose an option to not do that at as well?

@bryevdv
Copy link
Contributor Author

bryevdv commented Aug 16, 2022

@magnatelee you also mentioned wanting to reduce GPU oversubscription, that would be easy to roll into this PR if you can describe the changes, e.g. increasing the default fbmem value to more than 4Gb? Alternatively we could increase the bloat factor, but note that currently the bloat factor does not affect the value passed to --fbmem for the individual test (or should it? currently it just feeds in to adjusts the number of workers)

@magnatelee
Copy link
Contributor

magnatelee commented Aug 16, 2022

This current PR changes the algorithm to unconditionally "combine" hyper-cores before passing to --bind-cpu. Do we want to expose an option to not do that at as well?

That option is useful to have I guess, but I've never found it helpful to split hyperthreads across different processes.

you also mentioned wanting to reduce GPU oversubscription, that would be easy to roll into this PR if you can describe the changes, e.g. increasing the default fbmem value to more than 4Gb? Alternatively we could increase the bloat factor, but note that currently the bloat factor does not affect the value passed to --fbmem for the individual test (or should it? currently it just feeds in to adjusts the number of workers)

I suspect the oversubscription factor isn't causing the intermittent failure, but back-to-back execution is. FYI, I'm seeing the failure, though less frequently, even with the reduced oversubscription. So, it might be simpler to just add delay between tests within the same shard. Even better is to detect the case that the test fails due to starting the test prematurely and retry that test.

@bryevdv
Copy link
Contributor Author

bryevdv commented Aug 16, 2022

That option is useful to have I guess, but I've never found it helpful to split hyperthreads across different processes.

I might have explained poorly, because this sounds the opposite of what is happening. In more detail, assume there are hyper-core sibling pairs:

(0, 4)  (1, 5)  (2, 6)  (3, 7)

The change in the PR means:

  • There are 4 (not 8) CPUs for purposes of test running, and
  • if the "first" and "second" CPUs are in a shard, it is --cpu-bind 0,4,1,5

Even better is to detect the case that the test fails due to starting the test prematurely and retry that test.

I don't think there is any way to know that a test failed for any particular reason, only that it failed. If you think a generic "retry N times" option is valuable, I could add that in a dedicated feature PR.

@manopapad
Copy link
Contributor

The change in the PR means:

  • There are 4 (not 8) CPUs for purposes of test running, and
  • if the "first" and "second" CPUs are in a shard, it is --cpu-bind 0,4,1,5

I agree with this behavior. The alternative would be to allow splitting a physical core's two virtual cores betwen two different workers, which Wonchan and I seem to agree is not that useful.

I also agree that we should allocate a full physical core for each CPU/OMP processor, even when strict pinning is turned off.

@bryevdv
Copy link
Contributor Author

bryevdv commented Aug 17, 2022

So, it might be simpler to just add delay between tests within the same shard.

@magnatelee I would make this configurable, but how big of a delay should be used by default? Should this only apply to gpu tests?

@magnatelee
Copy link
Contributor

magnatelee commented Aug 17, 2022

@magnatelee I would make this configurable, but how big of a delay should be used by default? Should this only apply to gpu tests?

Let's put 2 seconds between tests by default. (There's no reasoning behind this number though.) And yes, this is necessary only for GPU tests. I think in a follow-up PR we should do this in a proper way by polling the GPU state to wait until the previous test processes relinquish the GPUs.

@bryevdv
Copy link
Contributor Author

bryevdv commented Aug 17, 2022

@magnatelee please check out cbfb0c7

@bryevdv bryevdv merged commit 86b7a62 into nv-legate:branch-22.10 Aug 17, 2022
@bryevdv bryevdv deleted the bv/test_pinning_options branch August 17, 2022 20:59
jjwilke pushed a commit to jjwilke/cunumeric that referenced this pull request Sep 2, 2022
* Refactor test runner to support more pinning options

* add --gpu-delay option
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants