Skip to content

fix: don't leak exec I/O threads when waiting on the change fails#2558

Open
tonyandrewmeyer wants to merge 2 commits into
canonical:mainfrom
tonyandrewmeyer:fix-exec-wait-thread-leak
Open

fix: don't leak exec I/O threads when waiting on the change fails#2558
tonyandrewmeyer wants to merge 2 commits into
canonical:mainfrom
tonyandrewmeyer:fix-exec-wait-thread-leak

Conversation

@tonyandrewmeyer

Copy link
Copy Markdown
Collaborator

If the change for an exec doesn't finish before the client-side timeout, or waiting on it fails in any other way, ExecProcess._wait() propagates the exception without any teardown. The I/O pump threads are left blocked in recv() on the still open websockets, and being non-daemon threads they then block interpreter shutdown indefinitely, wedging the Juju hook. See #2556 for the reproduction details (note that the leak needs the change to outlive the client-side wait, for example an orphan holding the exec's stdio pipe open; a plain exec timeout completes the change server-side and already cleans up correctly).

This tears the exec down when the wait fails:

  • Shut the websocket connections down at the OS level (SHUT_RDWR) rather than only closing the file descriptors. Closing a socket doesn't wake other threads blocked in recv() on it, so the finally approach suggested in the issue (or joining the threads first, as the success path does) would hang the hook immediately instead of at exit.
  • Treat a connection closed underneath an I/O pump thread (WebSocketException or OSError) as end-of-stream rather than letting the exception escape the thread, which printed noisy WebSocketConnectionClosedException tracebacks via threading.excepthook.
  • Create the exec I/O threads as daemon threads, so any teardown path that's still missed (including never calling wait() at all) can't block interpreter shutdown.
  • Close any already-connected websockets when connecting the exec websockets fails partway through, rather than leaking them (found while auditing for related leaks).

The behaviour of the success path is unchanged: the threads are still joined before the websockets are shut down, so output is never truncated.

Verified against a real Pebble (v1.31.0) on Python 3.10 through 3.14: before the fix the reproduction from #2556 leaks two copyfileobj threads and hangs at exit; with the fix it raises ops.pebble.TimeoutError as before, leaves no threads behind, and exits cleanly.

Fixes #2556

tonyandrewmeyer and others added 2 commits June 10, 2026 20:14
…ails

If the change for an exec didn't finish before the client-side timeout
(for example, because an orphaned grandchild of the exec'd command kept
the stdio pipe open past the exec timeout), or waiting on it failed in
any other way, ExecProcess._wait() propagated the exception without any
teardown. The I/O pump threads were left blocked in recv() on the still
open websockets, and being non-daemon threads they then blocked
interpreter shutdown indefinitely (wedging the Juju hook).

Tear the exec down when the wait fails:

- Shut the websocket connections down at the OS level (SHUT_RDWR), not
  just close the file descriptors: closing a socket doesn't wake other
  threads blocked in recv() on it, so a plain shutdown() (or joining the
  threads first, as the success path does) would hang.
- Treat a connection closed underneath an I/O pump thread as
  end-of-stream rather than letting the exception escape, which printed
  noisy WebSocketConnectionClosedException tracebacks from
  threading.excepthook.
- Create the I/O threads as daemon threads, so that any teardown path
  that's still missed (including never calling wait()) can't block
  interpreter shutdown.

Also close any already-connected websockets when connecting the exec
websockets fails partway through, rather than leaking them.

Fixes canonical#2556
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…#21)

The error path added in _teardown_after_error() nulls out _cancel_stdin
and _cancel_reader after using them, but the success path in _wait() did
not. Because both wait() and wait_output() call _wait(), calling them in
turn (or wait() twice) would re-run _cancel_stdin() and os.close() the
already-closed cancel_reader fd, raising OSError on the second call.

Clear both on the success path too, matching _teardown_after_error(), and
add a regression test. Also note in _force_close_websocket() that it
depends on websocket-client's internal WebSocket.sock attribute.

Co-authored-by: Claude <noreply@anthropic.com>
@tonyandrewmeyer tonyandrewmeyer marked this pull request as ready for review June 10, 2026 22:52

@hpidcock hpidcock left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems logical, not a python expert so had to read up on threading/daemon threads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ExecProcess.wait()/wait_output() leak non-daemon I/O threads when wait_change raises (e.g. exec timeout)

2 participants