Refactor test runner to support more pinning options#535
Refactor test runner to support more pinning options#535bryevdv merged 2 commits intonv-legate:branch-22.10from bryevdv:bv/test_pinning_options
Conversation
|
@magnatelee you also mentioned wanting to reduce GPU oversubscription, that would be easy to roll into this PR if you can describe the changes, e.g. increasing the default |
That option is useful to have I guess, but I've never found it helpful to split hyperthreads across different processes.
I suspect the oversubscription factor isn't causing the intermittent failure, but back-to-back execution is. FYI, I'm seeing the failure, though less frequently, even with the reduced oversubscription. So, it might be simpler to just add delay between tests within the same shard. Even better is to detect the case that the test fails due to starting the test prematurely and retry that test. |
I might have explained poorly, because this sounds the opposite of what is happening. In more detail, assume there are hyper-core sibling pairs: The change in the PR means:
I don't think there is any way to know that a test failed for any particular reason, only that it failed. If you think a generic "retry N times" option is valuable, I could add that in a dedicated feature PR. |
I agree with this behavior. The alternative would be to allow splitting a physical core's two virtual cores betwen two different workers, which Wonchan and I seem to agree is not that useful. I also agree that we should allocate a full physical core for each CPU/OMP processor, even when strict pinning is turned off. |
@magnatelee I would make this configurable, but how big of a delay should be used by default? Should this only apply to gpu tests? |
Let's put 2 seconds between tests by default. (There's no reasoning behind this number though.) And yes, this is necessary only for GPU tests. I think in a follow-up PR we should do this in a proper way by polling the GPU state to wait until the previous test processes relinquish the GPUs. |
|
@magnatelee please check out cbfb0c7 |
* Refactor test runner to support more pinning options * add --gpu-delay option
This PR updates the test runner
test.pyin the following was:{"REALM_SYNTHETIC_CORE_MAP": ""}by default--strict-pinoption to not setREALM_SYNTHETIC_CORE_MAPand also account for a Python processor in the worker algorithm (in addition to specified utility)--cpu-bind(i.e. for 2-way HT--cpu-bindwill always receive N pairs of ids, where each pair is two HT sibling cores)--fbmemunits to MBIn "strict" mode, there should not be any Realm reservation warnings such as
cc @manopapad @magnatelee This current PR changes the algorithm to unconditionally "combine" hyper-cores before passing to
--bind-cpu. Do we want to expose an option to not do that at as well?