Skip to content

ExecProcess.wait()/wait_output() leak non-daemon I/O threads when wait_change raises (e.g. exec timeout) #2556

@skourta

Description

@skourta

I first noticed the issue in the OpenSerach k8s charm where the charm just hangs indefinitely on a hook.

Disclaimer: I generated the issue description with AI

Summary

When an ExecProcess's underlying change wait fails — most commonly when an exec
timeout elapses, but also on any error from wait_change()ExecProcess._wait()
propagates the exception before it joins its I/O pump threads and shuts down the
exec websockets. Because those pump threads are created with the default
daemon=False, they remain blocked in websocket.recv() on websockets that are
never closed.

This has two consequences:

  1. Interpreter shutdown hangs. threading._shutdown() blocks forever joining the
    still-running non-daemon threads (observed as a hang in lock.acquire() at process
    exit).
  2. Noisy unraisable tracebacks. If/when the threads eventually unblock (peer closes
    the connection), they raise
    websocket._exceptions.WebSocketConnectionClosedException: socket is already closed.
    to threading.excepthook, printing full tracebacks from
    shutil.copyfileobjops.pebble._WebsocketReader.readws.recv().

Root cause

In ExecProcess._wait() (ops/pebble.py ~L1784):

def _wait(self) -> int:
    self._waited = True
    timeout = self._timeout
    if timeout is not None:
        timeout += 1
    change = self._client.wait_change(self._change_id, timeout=timeout)   # L1790  <-- can raise

    # --- cleanup below only runs if wait_change() returned normally ---
    if self._cancel_stdin is not None:        # L1793
        self._cancel_stdin()
    for thread in self._threads:              # L1797
        thread.join()
    if self._cancel_reader is not None:       # L1802
        os.close(self._cancel_reader)
    self._control_ws.shutdown()               # L1806
    self._stdio_ws.shutdown()
    if self._stderr_ws is not None:
        self._stderr_ws.shutdown()
    ...

self._client.wait_change(...) at L1790 raises on the client-side websocket timeout
(timeout + 1) when the server doesn't complete the change in time. The exception
propagates straight out of _wait(), so the cleanup at L1793–L1809 (cancel stdin,
join threads, shutdown websockets) is skipped.

The leaked threads are:

  • the shutil.copyfileobj pumps started in wait_output() (L1845–L1850), and
  • the forwarder threads started in Client.exec()
    (_reader_to_websocket / _websocket_to_writer, ~L3172–L3197).

They are started via _start_thread() (L412–L416):

def _start_thread(target, *args, **kwargs):
    thread = threading.Thread(target=target, args=args, kwargs=kwargs)  # daemon=False
    thread.start()
    return thread

Both wait() (L1769) and wait_output() (L1819) are affected, since both delegate to
_wait().

Reproduction

container.exec(["sleep", "60"], timeout=5, combine_stderr=True, encoding="utf-8").wait_output()

The exec server-side timeout / client-side wait_change timeout fires, wait_output()
raises, and the stdout copyfileobj thread is left blocked on self._stdio_ws. The
process then either hangs at interpreter exit (threads still blocked) or prints, when
they later unblock:

Exception in thread Thread-14 (copyfileobj):
Traceback (most recent call last):
...
File ".../ops/pebble.py", line 1982, in read
    chunk = self.ws.recv()
File ".../websocket/_core.py", line 395, in recv
...
File ".../websocket/_socket.py", line 95, in recv
    raise WebSocketConnectionClosedException("socket is already closed.")
websocket._exceptions.WebSocketConnectionClosedException: socket is already closed.

Impact

  • Any consumer that runs exec(..., timeout=...) and handles the resulting exception
    can hang at process exit, or emit alarming-looking tracebacks from background threads
    that are actually benign teardown noise.
  • This is particularly painful in Juju charm hooks: a single timed-out exec can wedge
    the hook process (Juju then has to kill it) or flood juju debug-log with thread
    tracebacks.

Suggested fix

Guarantee the teardown runs whether or not wait_change() succeeds — i.e. move the
cancel-stdin / join-threads / shutdown-websockets steps into a finally block in
_wait():

def _wait(self) -> int:
    self._waited = True
    timeout = self._timeout
    if timeout is not None:
        timeout += 1
    try:
        change = self._client.wait_change(self._change_id, timeout=timeout)
    finally:
        if self._cancel_stdin is not None:
            self._cancel_stdin()
        for thread in self._threads:
            thread.join()
        if self._cancel_reader is not None:
            os.close(self._cancel_reader)
        self._control_ws.shutdown()
        self._stdio_ws.shutdown()
        if self._stderr_ws is not None:
            self._stderr_ws.shutdown()
    ...

Note: thread.join() in the cleanup path is itself only safe to call after
the websockets are shut down (otherwise the join can block on a still-blocked
recv()); the team may want to shut the websockets down first, then join, in the
failure path. A simpler/more robust belt-and-braces option is to also create the I/O
pump threads as daemon=True so a missed teardown can never hang interpreter
shutdown.

Environment

  • ops 3.7.1 (also present in 3.7.0; the relevant code is unchanged across recent 3.x)
  • websocket-client 1.9.0
  • Python 3.12 / 3.13

Metadata

Metadata

Labels

rainy daySmall items done in ~10% of each week's timesmall itemA small item, for some value of 'small'

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions