EphemeralRunner left stuck Running after node drain/pod termination

### Checks

- [x] I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
- [x] I am using charts that are officially provided

### Controller Version

0.12.0

### Deployment Method

Helm

### Checks

- [x] This isn't a question or user support case (For Q&A and community support, go to [Discussions](https://github.com/actions/actions-runner-controller/discussions)).
- [x] I've read the [Changelog](https://github.com/actions/actions-runner-controller/blob/master/docs/gha-runner-scale-set-controller/README.md#changelog) before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

### To Reproduce

```markdown
1. Start a long-running GHA job
2. Run `kubectl drain <node-name>` on the EKS node running the pod for the allocated EphemeralRunner.  (Directly deleting the runner pod with `kubectl delete pod <pod-name>` also has the same effect, but isn't what we normally do/experience.)
3. Observe the Runner in GHE list of active runners goes away
4. Observe the EphemeralRunner in K8s stays in `Running` state for ever
```

### Describe the bug

While the Runner (as recorded by the GitHub Actions list of org-attached Runners in the Settings page) goes away, the EphemeralRunner stays allocated forever.

This messes with AutoScalingRunnerSets thinking it doesn't need to scale up any more & we observe long wait times for new Runners to be allocated to Jobs.  Until this is fixed we have to delete the ERs manually with a script like the below:

```bash
#!/usr/bin/env bash

set -euo pipefail

STUCK_RUNNERS=$(kubectl get ephemeralrunners -n gha-runner-scale-set -o json \
  | jq -r '.items[] | select(.status.phase == "Running" and .status.ready == false and .status.jobRepositoryName != null) | .metadata.name' \
  | tr '\n' ' ')

if [ -z "$STUCK_RUNNERS" ]; then
  echo "No stuck EphemeralRunners."
  exit 0
fi

echo "Deleting: $STUCK_RUNNERS"
kubectl delete ephemeralrunners -n gha-runner-scale-set $STUCK_RUNNERS
```

### Describe the expected behavior

For the ARC to at least notice the Runner has disappeared & to delete the stuck EphemeralRunner automatically.

The best solution would be for the ARC to resubmit the job for a re-run if it sees this condition, or at least emit an specific K8s event such that we could add such automation on top easily via a custom watcher.

### Additional Context

Here's the redacted YAML for the ER itself, complete with status: <https://gist.github.com/tyrken/7810a7c51739511585abcced460176ab#file-x86-64-h8696-runner-jzgnk-yaml>

### Controller Logs

See <https://gist.github.com/tyrken/7810a7c51739511585abcced460176ab#file-controller-logs-txt> - the node was drained around 19:11 UTC.

See also the listener logs if of any interest: <https://gist.github.com/tyrken/7810a7c51739511585abcced460176ab#file-listener-logs-txt>

### Runner Pod Logs

See <https://gist.github.com/tyrken/7810a7c51739511585abcced460176ab#file-runner-logs-tsv>

(Note copied from logging server as runner/pod is deleted during bug reproduction)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EphemeralRunner left stuck Running after node drain/pod termination #4148

Checks

Controller Version

Deployment Method

Checks

To Reproduce

Describe the bug

Describe the expected behavior

Additional Context

Controller Logs

Runner Pod Logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

EphemeralRunner left stuck Running after node drain/pod termination #4148

Description

Checks

Controller Version

Deployment Method

Checks

To Reproduce

Describe the bug

Describe the expected behavior

Additional Context

Controller Logs

Runner Pod Logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions