I first noticed the issue in the OpenSerach k8s charm where the charm just hangs indefinitely on a hook.
Disclaimer: I generated the issue description with AI
Summary
When an ExecProcess's underlying change wait fails — most commonly when an exec
timeout elapses, but also on any error from wait_change() — ExecProcess._wait()
propagates the exception before it joins its I/O pump threads and shuts down the
exec websockets. Because those pump threads are created with the default
daemon=False, they remain blocked in websocket.recv() on websockets that are
never closed.
This has two consequences:
- Interpreter shutdown hangs.
threading._shutdown() blocks forever joining the
still-running non-daemon threads (observed as a hang in lock.acquire() at process
exit).
- Noisy unraisable tracebacks. If/when the threads eventually unblock (peer closes
the connection), they raise
websocket._exceptions.WebSocketConnectionClosedException: socket is already closed.
to threading.excepthook, printing full tracebacks from
shutil.copyfileobj → ops.pebble._WebsocketReader.read → ws.recv().
Root cause
In ExecProcess._wait() (ops/pebble.py ~L1784):
def _wait(self) -> int:
self._waited = True
timeout = self._timeout
if timeout is not None:
timeout += 1
change = self._client.wait_change(self._change_id, timeout=timeout) # L1790 <-- can raise
# --- cleanup below only runs if wait_change() returned normally ---
if self._cancel_stdin is not None: # L1793
self._cancel_stdin()
for thread in self._threads: # L1797
thread.join()
if self._cancel_reader is not None: # L1802
os.close(self._cancel_reader)
self._control_ws.shutdown() # L1806
self._stdio_ws.shutdown()
if self._stderr_ws is not None:
self._stderr_ws.shutdown()
...
self._client.wait_change(...) at L1790 raises on the client-side websocket timeout
(timeout + 1) when the server doesn't complete the change in time. The exception
propagates straight out of _wait(), so the cleanup at L1793–L1809 (cancel stdin,
join threads, shutdown websockets) is skipped.
The leaked threads are:
- the
shutil.copyfileobj pumps started in wait_output() (L1845–L1850), and
- the forwarder threads started in
Client.exec()
(_reader_to_websocket / _websocket_to_writer, ~L3172–L3197).
They are started via _start_thread() (L412–L416):
def _start_thread(target, *args, **kwargs):
thread = threading.Thread(target=target, args=args, kwargs=kwargs) # daemon=False
thread.start()
return thread
Both wait() (L1769) and wait_output() (L1819) are affected, since both delegate to
_wait().
Reproduction
container.exec(["sleep", "60"], timeout=5, combine_stderr=True, encoding="utf-8").wait_output()
The exec server-side timeout / client-side wait_change timeout fires, wait_output()
raises, and the stdout copyfileobj thread is left blocked on self._stdio_ws. The
process then either hangs at interpreter exit (threads still blocked) or prints, when
they later unblock:
Exception in thread Thread-14 (copyfileobj):
Traceback (most recent call last):
...
File ".../ops/pebble.py", line 1982, in read
chunk = self.ws.recv()
File ".../websocket/_core.py", line 395, in recv
...
File ".../websocket/_socket.py", line 95, in recv
raise WebSocketConnectionClosedException("socket is already closed.")
websocket._exceptions.WebSocketConnectionClosedException: socket is already closed.
Impact
- Any consumer that runs
exec(..., timeout=...) and handles the resulting exception
can hang at process exit, or emit alarming-looking tracebacks from background threads
that are actually benign teardown noise.
- This is particularly painful in Juju charm hooks: a single timed-out exec can wedge
the hook process (Juju then has to kill it) or flood juju debug-log with thread
tracebacks.
Suggested fix
Guarantee the teardown runs whether or not wait_change() succeeds — i.e. move the
cancel-stdin / join-threads / shutdown-websockets steps into a finally block in
_wait():
def _wait(self) -> int:
self._waited = True
timeout = self._timeout
if timeout is not None:
timeout += 1
try:
change = self._client.wait_change(self._change_id, timeout=timeout)
finally:
if self._cancel_stdin is not None:
self._cancel_stdin()
for thread in self._threads:
thread.join()
if self._cancel_reader is not None:
os.close(self._cancel_reader)
self._control_ws.shutdown()
self._stdio_ws.shutdown()
if self._stderr_ws is not None:
self._stderr_ws.shutdown()
...
Note: thread.join() in the cleanup path is itself only safe to call after
the websockets are shut down (otherwise the join can block on a still-blocked
recv()); the team may want to shut the websockets down first, then join, in the
failure path. A simpler/more robust belt-and-braces option is to also create the I/O
pump threads as daemon=True so a missed teardown can never hang interpreter
shutdown.
Environment
ops 3.7.1 (also present in 3.7.0; the relevant code is unchanged across recent 3.x)
websocket-client 1.9.0
- Python 3.12 / 3.13
I first noticed the issue in the OpenSerach k8s charm where the charm just hangs indefinitely on a hook.
Disclaimer: I generated the issue description with AI
Summary
When an
ExecProcess's underlying change wait fails — most commonly when an exectimeout elapses, but also on any error from
wait_change()—ExecProcess._wait()propagates the exception before it joins its I/O pump threads and shuts down the
exec websockets. Because those pump threads are created with the default
daemon=False, they remain blocked inwebsocket.recv()on websockets that arenever closed.
This has two consequences:
threading._shutdown()blocks forever joining thestill-running non-daemon threads (observed as a hang in
lock.acquire()at processexit).
the connection), they raise
websocket._exceptions.WebSocketConnectionClosedException: socket is already closed.to
threading.excepthook, printing full tracebacks fromshutil.copyfileobj→ops.pebble._WebsocketReader.read→ws.recv().Root cause
In
ExecProcess._wait()(ops/pebble.py~L1784):self._client.wait_change(...)at L1790 raises on the client-side websocket timeout(
timeout + 1) when the server doesn't complete the change in time. The exceptionpropagates straight out of
_wait(), so the cleanup at L1793–L1809 (cancel stdin,join threads, shutdown websockets) is skipped.
The leaked threads are:
shutil.copyfileobjpumps started inwait_output()(L1845–L1850), andClient.exec()(
_reader_to_websocket/_websocket_to_writer, ~L3172–L3197).They are started via
_start_thread()(L412–L416):Both
wait()(L1769) andwait_output()(L1819) are affected, since both delegate to_wait().Reproduction
The exec server-side timeout / client-side
wait_changetimeout fires,wait_output()raises, and the stdout
copyfileobjthread is left blocked onself._stdio_ws. Theprocess then either hangs at interpreter exit (threads still blocked) or prints, when
they later unblock:
Impact
exec(..., timeout=...)and handles the resulting exceptioncan hang at process exit, or emit alarming-looking tracebacks from background threads
that are actually benign teardown noise.
the hook process (Juju then has to kill it) or flood
juju debug-logwith threadtracebacks.
Suggested fix
Guarantee the teardown runs whether or not
wait_change()succeeds — i.e. move thecancel-stdin / join-threads / shutdown-websockets steps into a
finallyblock in_wait():Environment
ops3.7.1 (also present in 3.7.0; the relevant code is unchanged across recent 3.x)websocket-client1.9.0