fix: bound timeouts when submit_timeout is set#665
Merged
Conversation
When a SAS job causes an infinite loop, the compute service can become CPU-starved and unable to respond to HTTP requests promptly. Two places in the submit() polling loop used timeout=None, allowing indefinite hangs: 1. Status poll (conn.getresponse() for the ?wait=N long-poll): if the server is too slow to respond, the deadline check never fires. Fixed by setting conn.timeout = delay + 10 before the polling loop when submit_timeout is active. A TimeoutError from a slow server is caught and the loop continues, where the deadline check at the top will fire. 2. Cancel-job PUT (conn.connect() after the deadline fires): if the server is still CPU-starved, the cancel request blocks indefinitely and SASsubmitTimeout is never raised. Fixed by temporarily setting conn.timeout = 5 for the cancel attempt, restoring it in a finally block. Together these ensure SASsubmitTimeout is raised within approximately submit_timeout + delay + 10 + 5 seconds worst-case, regardless of whether the server crashes (segfault) or enters an infinite loop. Behavior is unchanged when submit_timeout is not passed. Signed-off-by: Michael Senter, PhD <dmsenter89@users.noreply.github.com>
Signed-off-by: Michael Senter, PhD <dmsenter89@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What Problem does this PR address?
submit_timeoutfrom #664 works correctly when SAS crashes outright (segfault): thecompute service detects the crash and remains responsive, so the cancel request
completes quickly and
SASsubmitTimeoutis raised as expected.When SAS enters an infinite loop instead of crashing, the compute service stays up
but becomes CPU-starved. Two
timeout=Nonesockets cause indefinite hangs:conn.getresponse()for the?wait=Nlong-poll blocks until theserver can respond, which may be far longer than
Nseconds. The deadline checknever fires.
conn.connect()for thebest-effort
PUTcancel blocks indefinitely, soSASsubmitTimeoutis never raised.How this PR Addresses the Problem
Two targeted additions to the
submit()polling loop insasiohttp.py:submit_timeoutis set,conn.timeoutis set todelay + 10seconds. ATimeoutErrorfrom a slow server is caught, theconnection is recycled, and the loop continues — where the deadline check at the
top will fire.
conn.timeoutis temporarily set to5seconds beforeconn.connect(), ensuring the cancel attempt always gives up promptly.Behavior is unchanged when
submit_timeoutis not passed.Relation to #664
submit_timeout,SASsubmitTimeout, and the deadline logic from #664 are untouched.This PR makes them work for the full range of failure modes, not just server crashes.