Checks
Controller Version
0.12.0
Deployment Method
Helm
Checks
To Reproduce
1. Start a long-running GHA job
2. Run `kubectl drain <node-name>` on the EKS node running the pod for the allocated EphemeralRunner. (Directly deleting the runner pod with `kubectl delete pod <pod-name>` also has the same effect, but isn't what we normally do/experience.)
3. Observe the Runner in GHE list of active runners goes away
4. Observe the EphemeralRunner in K8s stays in `Running` state for ever
Describe the bug
While the Runner (as recorded by the GitHub Actions list of org-attached Runners in the Settings page) goes away, the EphemeralRunner stays allocated forever.
This messes with AutoScalingRunnerSets thinking it doesn't need to scale up any more & we observe long wait times for new Runners to be allocated to Jobs. Until this is fixed we have to delete the ERs manually with a script like the below:
#!/usr/bin/env bash
set -euo pipefail
STUCK_RUNNERS=$(kubectl get ephemeralrunners -n gha-runner-scale-set -o json \
| jq -r '.items[] | select(.status.phase == "Running" and .status.ready == false and .status.jobRepositoryName != null) | .metadata.name' \
| tr '\n' ' ')
if [ -z "$STUCK_RUNNERS" ]; then
echo "No stuck EphemeralRunners."
exit 0
fi
echo "Deleting: $STUCK_RUNNERS"
kubectl delete ephemeralrunners -n gha-runner-scale-set $STUCK_RUNNERS
Describe the expected behavior
For the ARC to at least notice the Runner has disappeared & to delete the stuck EphemeralRunner automatically.
The best solution would be for the ARC to resubmit the job for a re-run if it sees this condition, or at least emit an specific K8s event such that we could add such automation on top easily via a custom watcher.
Additional Context
Here's the redacted YAML for the ER itself, complete with status: https://gist.github.com/tyrken/7810a7c51739511585abcced460176ab#file-x86-64-h8696-runner-jzgnk-yaml
Controller Logs
See https://gist.github.com/tyrken/7810a7c51739511585abcced460176ab#file-controller-logs-txt - the node was drained around 19:11 UTC.
See also the listener logs if of any interest: https://gist.github.com/tyrken/7810a7c51739511585abcced460176ab#file-listener-logs-txt
Runner Pod Logs
See https://gist.github.com/tyrken/7810a7c51739511585abcced460176ab#file-runner-logs-tsv
(Note copied from logging server as runner/pod is deleted during bug reproduction)
Checks
Controller Version
0.12.0
Deployment Method
Helm
Checks
To Reproduce
Describe the bug
While the Runner (as recorded by the GitHub Actions list of org-attached Runners in the Settings page) goes away, the EphemeralRunner stays allocated forever.
This messes with AutoScalingRunnerSets thinking it doesn't need to scale up any more & we observe long wait times for new Runners to be allocated to Jobs. Until this is fixed we have to delete the ERs manually with a script like the below:
Describe the expected behavior
For the ARC to at least notice the Runner has disappeared & to delete the stuck EphemeralRunner automatically.
The best solution would be for the ARC to resubmit the job for a re-run if it sees this condition, or at least emit an specific K8s event such that we could add such automation on top easily via a custom watcher.
Additional Context
Here's the redacted YAML for the ER itself, complete with status: https://gist.github.com/tyrken/7810a7c51739511585abcced460176ab#file-x86-64-h8696-runner-jzgnk-yaml
Controller Logs
See https://gist.github.com/tyrken/7810a7c51739511585abcced460176ab#file-controller-logs-txt - the node was drained around 19:11 UTC.
See also the listener logs if of any interest: https://gist.github.com/tyrken/7810a7c51739511585abcced460176ab#file-listener-logs-txt
Runner Pod Logs
See https://gist.github.com/tyrken/7810a7c51739511585abcced460176ab#file-runner-logs-tsv
(Note copied from logging server as runner/pod is deleted during bug reproduction)