Skip to content

Commit 13f13c2

Browse files
authored
fix(server): add startup probe for gateway boot (#417)
1 parent cf66d05 commit 13f13c2

4 files changed

Lines changed: 55 additions & 5 deletions

File tree

architecture/gateway-single-node.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -188,9 +188,11 @@ After the container starts:
188188
1. **Clean stale nodes**: `clean_stale_nodes()` finds `NotReady` nodes via `kubectl get nodes` and deletes them. This is needed when a container is recreated but reuses the persistent volume -- k3s registers a new node (using the container ID as hostname) while old node entries persist in etcd. Non-fatal on error; returns the count of removed nodes.
189189
2. **Push local images** (optional, local deploy only): If `OPENSHELL_PUSH_IMAGES` is set, the comma-separated image refs are exported from the local Docker daemon as a single tar, uploaded into the container via `docker put_archive`, and imported into containerd via `ctr images import` in the `k8s.io` namespace. After import, `kubectl rollout restart deployment/openshell openshell` is run, followed by `kubectl rollout status --timeout=180s` to wait for completion. See `crates/openshell-bootstrap/src/push.rs`.
190190
3. **Wait for gateway health**: `wait_for_gateway_ready()` polls the Docker HEALTHCHECK status up to 180 times, 2 seconds apart (6 min total). A background task streams container logs during this wait. Failure modes:
191-
- Container exits during polling: error includes recent log lines.
192-
- Container has no HEALTHCHECK instruction: fails immediately.
193-
- HEALTHCHECK reports unhealthy on final attempt: error includes recent logs.
191+
- Container exits during polling: error includes recent log lines.
192+
- Container has no HEALTHCHECK instruction: fails immediately.
193+
- HEALTHCHECK reports unhealthy on final attempt: error includes recent logs.
194+
195+
The gateway StatefulSet also uses a Kubernetes `startupProbe` on the gRPC port before steady-state liveness and readiness checks begin. This gives single-node k3s boots extra time to absorb early networking and flannel initialization delay without restarting the gateway pod too aggressively.
194196

195197
### 5) mTLS bundle capture
196198

crates/openshell-server/src/lib.rs

Lines changed: 40 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,9 +25,10 @@ mod ws_tunnel;
2525

2626
use openshell_core::{Config, Error, Result};
2727
use std::collections::HashMap;
28+
use std::io::ErrorKind;
2829
use std::sync::{Arc, Mutex};
2930
use tokio::net::TcpListener;
30-
use tracing::{error, info};
31+
use tracing::{debug, error, info};
3132

3233
pub use grpc::OpenShellService;
3334
pub use http::{health_router, http_router};
@@ -67,6 +68,13 @@ pub struct ServerState {
6768
pub ssh_connections_by_sandbox: Mutex<HashMap<String, u32>>,
6869
}
6970

71+
fn is_benign_tls_handshake_failure(error: &std::io::Error) -> bool {
72+
matches!(
73+
error.kind(),
74+
ErrorKind::UnexpectedEof | ErrorKind::ConnectionReset
75+
)
76+
}
77+
7078
impl ServerState {
7179
/// Create new server state.
7280
#[must_use]
@@ -198,7 +206,11 @@ pub async fn run_server(config: Config, tracing_log_bus: TracingLogBus) -> Resul
198206
}
199207
}
200208
Err(e) => {
201-
error!(error = %e, client = %addr, "TLS handshake failed");
209+
if is_benign_tls_handshake_failure(&e) {
210+
debug!(error = %e, client = %addr, "TLS handshake closed early");
211+
} else {
212+
error!(error = %e, client = %addr, "TLS handshake failed");
213+
}
202214
}
203215
}
204216
});
@@ -211,3 +223,29 @@ pub async fn run_server(config: Config, tracing_log_bus: TracingLogBus) -> Resul
211223
}
212224
}
213225
}
226+
227+
#[cfg(test)]
228+
mod tests {
229+
use super::is_benign_tls_handshake_failure;
230+
use std::io::{Error, ErrorKind};
231+
232+
#[test]
233+
fn classifies_probe_style_tls_disconnects_as_benign() {
234+
for kind in [ErrorKind::UnexpectedEof, ErrorKind::ConnectionReset] {
235+
let error = Error::new(kind, "probe disconnected");
236+
assert!(is_benign_tls_handshake_failure(&error));
237+
}
238+
}
239+
240+
#[test]
241+
fn preserves_real_tls_failures_as_errors() {
242+
for kind in [
243+
ErrorKind::InvalidData,
244+
ErrorKind::PermissionDenied,
245+
ErrorKind::Other,
246+
] {
247+
let error = Error::new(kind, "real tls failure");
248+
assert!(!is_benign_tls_handshake_failure(&error));
249+
}
250+
}
251+
}

deploy/helm/openshell/templates/statefulset.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,12 @@ spec:
110110
- name: grpc
111111
containerPort: {{ .Values.service.port }}
112112
protocol: TCP
113+
startupProbe:
114+
tcpSocket:
115+
port: grpc
116+
periodSeconds: {{ .Values.probes.startup.periodSeconds }}
117+
timeoutSeconds: {{ .Values.probes.startup.timeoutSeconds }}
118+
failureThreshold: {{ .Values.probes.startup.failureThreshold }}
113119
livenessProbe:
114120
tcpSocket:
115121
port: grpc

deploy/helm/openshell/values.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,10 @@ podLifecycle:
4343
terminationGracePeriodSeconds: 5
4444

4545
probes:
46+
startup:
47+
periodSeconds: 2
48+
timeoutSeconds: 1
49+
failureThreshold: 30
4650
liveness:
4751
initialDelaySeconds: 2
4852
periodSeconds: 5

0 commit comments

Comments
 (0)