`ExecProcess.wait()`/`wait_output()` leak non-daemon I/O threads when `wait_change` raises (e.g. exec timeout)

I first noticed the issue in the OpenSerach k8s charm where the charm just hangs indefinitely on a hook.

Disclaimer: I generated the issue description with AI

## Summary

When an `ExecProcess`'s underlying change wait fails — most commonly when an exec
timeout elapses, but also on any error from `wait_change()` — `ExecProcess._wait()`
propagates the exception **before** it joins its I/O pump threads and shuts down the
exec websockets. Because those pump threads are created with the default
`daemon=False`, they remain blocked in `websocket.recv()` on websockets that are
never closed.

This has two consequences:

1. **Interpreter shutdown hangs.** `threading._shutdown()` blocks forever joining the
   still-running non-daemon threads (observed as a hang in `lock.acquire()` at process
   exit).
2. **Noisy unraisable tracebacks.** If/when the threads eventually unblock (peer closes
   the connection), they raise
   `websocket._exceptions.WebSocketConnectionClosedException: socket is already closed.`
   to `threading.excepthook`, printing full tracebacks from
   `shutil.copyfileobj` → `ops.pebble._WebsocketReader.read` → `ws.recv()`.

## Root cause

In `ExecProcess._wait()` (`ops/pebble.py` ~L1784):

```python
def _wait(self) -> int:
    self._waited = True
    timeout = self._timeout
    if timeout is not None:
        timeout += 1
    change = self._client.wait_change(self._change_id, timeout=timeout)   # L1790  <-- can raise

    # --- cleanup below only runs if wait_change() returned normally ---
    if self._cancel_stdin is not None:        # L1793
        self._cancel_stdin()
    for thread in self._threads:              # L1797
        thread.join()
    if self._cancel_reader is not None:       # L1802
        os.close(self._cancel_reader)
    self._control_ws.shutdown()               # L1806
    self._stdio_ws.shutdown()
    if self._stderr_ws is not None:
        self._stderr_ws.shutdown()
    ...
```

`self._client.wait_change(...)` at L1790 raises on the client-side websocket timeout
(`timeout + 1`) when the server doesn't complete the change in time. The exception
propagates straight out of `_wait()`, so the cleanup at L1793–L1809 (cancel stdin,
join threads, shutdown websockets) is skipped.

The leaked threads are:

- the `shutil.copyfileobj` pumps started in `wait_output()` (L1845–L1850), and
- the forwarder threads started in `Client.exec()`
  (`_reader_to_websocket` / `_websocket_to_writer`, ~L3172–L3197).

They are started via `_start_thread()` (L412–L416):

```python
def _start_thread(target, *args, **kwargs):
    thread = threading.Thread(target=target, args=args, kwargs=kwargs)  # daemon=False
    thread.start()
    return thread
```

Both `wait()` (L1769) and `wait_output()` (L1819) are affected, since both delegate to
`_wait()`.

## Reproduction

```python
container.exec(["sleep", "60"], timeout=5, combine_stderr=True, encoding="utf-8").wait_output()
```

The exec server-side timeout / client-side `wait_change` timeout fires, `wait_output()`
raises, and the stdout `copyfileobj` thread is left blocked on `self._stdio_ws`. The
process then either hangs at interpreter exit (threads still blocked) or prints, when
they later unblock:

```
Exception in thread Thread-14 (copyfileobj):
Traceback (most recent call last):
...
File ".../ops/pebble.py", line 1982, in read
    chunk = self.ws.recv()
File ".../websocket/_core.py", line 395, in recv
...
File ".../websocket/_socket.py", line 95, in recv
    raise WebSocketConnectionClosedException("socket is already closed.")
websocket._exceptions.WebSocketConnectionClosedException: socket is already closed.
```

## Impact

- Any consumer that runs `exec(..., timeout=...)` and handles the resulting exception
  can hang at process exit, or emit alarming-looking tracebacks from background threads
  that are actually benign teardown noise.
- This is particularly painful in Juju charm hooks: a single timed-out exec can wedge
  the hook process (Juju then has to kill it) or flood `juju debug-log` with thread
  tracebacks.

## Suggested fix

Guarantee the teardown runs whether or not `wait_change()` succeeds — i.e. move the
cancel-stdin / join-threads / shutdown-websockets steps into a `finally` block in
`_wait()`:

```python
def _wait(self) -> int:
    self._waited = True
    timeout = self._timeout
    if timeout is not None:
        timeout += 1
    try:
        change = self._client.wait_change(self._change_id, timeout=timeout)
    finally:
        if self._cancel_stdin is not None:
            self._cancel_stdin()
        for thread in self._threads:
            thread.join()
        if self._cancel_reader is not None:
            os.close(self._cancel_reader)
        self._control_ws.shutdown()
        self._stdio_ws.shutdown()
        if self._stderr_ws is not None:
            self._stderr_ws.shutdown()
    ...
```

> **Note:** `thread.join()` in the cleanup path is itself only safe to call **after**
> the websockets are shut down (otherwise the join can block on a still-blocked
> `recv()`); the team may want to shut the websockets down first, then join, in the
> failure path. A simpler/more robust belt-and-braces option is to also create the I/O
> pump threads as `daemon=True` so a missed teardown can never hang interpreter
> shutdown.

## Environment

- `ops` 3.7.1 (also present in 3.7.0; the relevant code is unchanged across recent 3.x)
- `websocket-client` 1.9.0
- Python 3.12 / 3.13


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ExecProcess.wait()`/`wait_output()` leak non-daemon I/O threads when `wait_change` raises (e.g. exec timeout) #2556

Summary

Root cause

Reproduction

Impact

Suggested fix

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ExecProcess.wait()/wait_output() leak non-daemon I/O threads when wait_change raises (e.g. exec timeout) #2556

Description

Summary

Root cause

Reproduction

Impact

Suggested fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`ExecProcess.wait()`/`wait_output()` leak non-daemon I/O threads when `wait_change` raises (e.g. exec timeout) #2556