Skip to content

Collate test output to allow workers > 1 with verbose output#507

Merged
bryevdv merged 8 commits intonv-legate:branch-22.10from
bryevdv:bv/test_collate
Aug 5, 2022
Merged

Collate test output to allow workers > 1 with verbose output#507
bryevdv merged 8 commits intonv-legate:branch-22.10from
bryevdv:bv/test_collate

Conversation

@bryevdv
Copy link
Contributor

@bryevdv bryevdv commented Aug 5, 2022

Still WIP but I want to see how CI performs.

@bryevdv bryevdv changed the title [WIP] Collate test output to allow workers [WIP] Collate test output to allow workers > 1 with verbose output Aug 5, 2022
@bryevdv bryevdv requested a review from magnatelee August 5, 2022 17:46
@bryevdv
Copy link
Contributor Author

bryevdv commented Aug 5, 2022

@magnatelee This generally seems better (or at least comparable) to old CI runs in most cases. The exception is the GPU tests, which still seem to be slower. The CI machines have 2 GPUs and the new algo still only decides to use 1 worker for both tests. It seems like that's incorrect, unless the GPUs have very low fbsize?

Edit: I guess so? nvidia-smi report s 16160MiB in the test, and so with current fbmem and bloat factor defaults:

int(fbsize // (config.fbmem * BLOAT_FACTOR)) # ~16 / (6 * 1.5) = 1 

I'm not sure why the GPU jobs would have run faster in the old arrangement tho (since they also had only 1 worker)

@bryevdv
Copy link
Contributor Author

bryevdv commented Aug 5, 2022

Well, this was incorrect:

since they also had only 1 worker

The old test runs use 2 workers

### Entering stage: GPU (with 2 workers)

which I think accounts for the ~2x slow-down for GPU tests.

@magnatelee should we adjust the worker spec computation in the GPU stage?

@bryevdv
Copy link
Contributor Author

bryevdv commented Aug 5, 2022

The previous workers computation was more or less the current one, except without the bloat factor. Reducing it from 1.5 to 1.25 just to see how CI performs. @magnatelee should we make the bloat factor confugurable?

@magnatelee
Copy link
Contributor

magnatelee commented Aug 5, 2022

Ok. I think some clarification would help here. First of all, the legate launcher sets the framebuffer size to 4GB by default. Second, the 6GB budget in the test script already takes the bloating factor into account. That means DEFAULT_GPU_MEMORY_BUDGET used in the current test script is not the framebuffer size the launcher uses but the bloated one. I think two changes are needed:

  1. Change DEFAULT_GPU_MEMORY_BUDGET to 4GB
  2. Pass config.fbmem to to the launcher when running GPU tests (i.e., add ["--fbmem", config.fbmem] to the launcher args for GPU tests)

Then, you can keep the bloating factor being 1.5 and everything works as it did before. I'm fine with making the bloating factor configurable as well, but I don't expect developers to configure it by themselves.

@bryevdv
Copy link
Contributor Author

bryevdv commented Aug 5, 2022

@magnatelee I have made suggestion 1) in 6aa5188

When I try to add suggestion 2) get program aborts on my local system:

+LEGATE_TEST=1 /home/bryan/work/legate.core/install38/bin/legate /home/bryan/work/cunumeric/examples/cholesky.py -cunumeric:test --gpus 2 --gpu-bind 0,1 --fbmem 4294967296
[FAIL] (GPU) examples/cholesky.py
   /home/bryan/work/legate.core/install38/bin/bind.sh: line 107: 82933 Aborted                 numactl "$@"

it looks like numactl does not know what to do with --fbmem (or further up, that bind.sh does not know how to pass it on to legate instead of numactl, possibly?)

@bryevdv
Copy link
Contributor Author

bryevdv commented Aug 5, 2022

Assuming the latest run is as good as the previous one:

image

I'd suggest merging this as-is, so that the team can benefit from faster CI jobs, and then making a follow-on issue about explicitly providing --fbmem

@magnatelee
Copy link
Contributor

I think this part is problematic: --fbmem 4294967296. --fbmem takes a size in megabytes, so your command is asking for a 4PB space on framebuffer.

@bryevdv
Copy link
Contributor Author

bryevdv commented Aug 5, 2022

asking for a 4PB space on framebuffer.

3af2be8 seems to fix locally :D

@bryevdv bryevdv merged commit 0c7da23 into nv-legate:branch-22.10 Aug 5, 2022
@bryevdv bryevdv deleted the bv/test_collate branch August 5, 2022 23:18
@bryevdv bryevdv changed the title [WIP] Collate test output to allow workers > 1 with verbose output Collate test output to allow workers > 1 with verbose output Aug 8, 2022
sbak5 pushed a commit to sbak5/cunumeric that referenced this pull request Aug 17, 2022
…v-legate#507)

* report times in summary lines

* fix typo

* Add an overall test suite summary

* defer test output until test completion

* remove -j 1 argument to test.sh

* try bloat factor = 1.25

* fix default fbsize and bloat factor

* specify fbmem in MB
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants