GH-43040: [C++] Reduce the recursion of many-join test#43042
GH-43040: [C++] Reduce the recursion of many-join test#43042pitrou merged 3 commits intoapache:mainfrom
Conversation
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or In the case of PARQUET issues on JIRA the title also supports: See also: |
|
@github-actions crossbow submit -g cpp |
|
Revision: 7251176 Submitted crossbow builds: ursacomputing/crossbow @ actions-5714627e25 |
|
@github-actions crossbow submit -g cpp |
|
Revision: 3e6acd8 Submitted crossbow builds: ursacomputing/crossbow @ actions-d143fbd7c0 |
Ran with join recursion = 16. |
Ran with join recursion = 72. |
|
@github-actions crossbow submit -g cpp |
|
Revision: 6ccaa6c Submitted crossbow builds: ursacomputing/crossbow @ actions-368846a25e |
|
|
Ran with join recursion = 16 again. |
|
Hi @pitrou @felipecrv , would you help to take a look? This will fix two long failing jobs. |
|
|
||
| // A fair number of joins to guarantee temp vector stack overflow before GH-41335. | ||
| const int num_joins = 64; | ||
| const int num_joins = 16; |
There was a problem hiding this comment.
To make sure this conservative value serves the same protection purpose, I've verified in my local that, by reverting commit 6c386da, the test failed (with "temp stack overflow") with 16 joins (actually the minimal number for joins to fail is 14).
There was a problem hiding this comment.
Can you condition the reduction on the specific platforms that can't handle num_joins=64? To ensure possible bugs on a high number of joins are caught in regression tests.
There was a problem hiding this comment.
Yeah that's a nice idea. It's just that the condition could be very tricky to identify. So far I've experienced the following combinations on number of joins being 64:
- Ubuntu w/ or w/o ASAN (the CI jobs), all good.
- MacOS w/ ASAN, stack overflow; MacOS w/o ASAN, good.
- Alpine and Emscripten w/o ASAN (the CI jobs), segfault or memory out-of-bound (presumably to be caused by stack overflow as well).
And I don't find macros to differentiate Linux distributions such as Alpine and Ubuntu. To enable at least one build to run 64-join, it seems the only safe condition is to enable 64 joins on Linux w/ ASAN - but that's just because we have only sanitizer build on Ubuntu.
Any suggestions?
There was a problem hiding this comment.
I think we can stick with 16 if it's enough to reproduce the issue.
There was a problem hiding this comment.
I think we can stick with 16 if it's enough to reproduce the issue.
What do you mean by "enough to repro the issue"? Reducing to 16 is making the issue "go away".
There was a problem hiding this comment.
I think the "issue" here means this:
by reverting commit 6c386da, the test failed (with "temp stack overflow") with 16 joins
In other words, 16 joins serves the purpose that this test is originally designed to cover.
|
I will let @pitrou approve and merge this one. |
pitrou
left a comment
There was a problem hiding this comment.
LGTM, thanks for fixing this @zanmato1984
|
After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 2a8fa3e. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
Rationale for this change
The current recursion 64 in many-join test is too aggressive so stack (the C program stack) overflow may happen on alpine or emscripten causing issues like #43040 .
What changes are included in this PR?
Reduce the recursion to 16, which is strong enough for the purpose of #41335 which introduced this test.
Are these changes tested?
Change is test.
Are there any user-facing changes?
None.