By default, communication with the OpenShell gateway is secured by mutual TLS (mTLS). The CLI, SDK, and sandbox pods present certificates signed by the cluster CA before they reach any application handler. The PKI is bootstrapped automatically during cluster deployment, and certificates are distributed to Kubernetes secrets and the local filesystem without manual configuration.
The gateway also supports Cloudflare-fronted deployments where the edge, not the gateway, is the first authentication boundary. In that mode the gateway either keeps TLS enabled but allows no-certificate client handshakes (allow_unauthenticated=true) and relies on application-layer Cloudflare JWTs, or disables gateway TLS entirely and serves plaintext behind a trusted reverse proxy or tunnel.
This document covers the certificate hierarchy, the bootstrap process, how gateway transport security modes are enforced, how sandboxes and the CLI consume their certificates, and the broader security model of the gateway.
graph TD
subgraph PKI["PKI (generated at bootstrap)"]
CA["openshell-ca<br/>(self-signed root)"]
SERVER_CERT["openshell-server cert<br/>(signed by CA)"]
CLIENT_CERT["openshell-client cert<br/>(signed by CA, shared)"]
CA --> SERVER_CERT
CA --> CLIENT_CERT
end
subgraph CLUSTER["Kubernetes Cluster"]
S1["openshell-server-tls<br/>Secret (server cert+key)"]
S2["openshell-server-client-ca<br/>Secret (CA cert)"]
S3["openshell-client-tls<br/>Secret (client cert+key+CA)"]
GW["Gateway Process<br/>(tokio-rustls)"]
SBX["Sandbox Pod"]
end
subgraph HOST["User's Machine"]
CLI["CLI"]
MTLS_DIR["~/.config/openshell/<br/>clusters/<name>/mtls/"]
end
SERVER_CERT --> S1
CA --> S2
CLIENT_CERT --> S3
CLIENT_CERT --> MTLS_DIR
S1 --> GW
S2 --> GW
S3 --> SBX
MTLS_DIR --> CLI
CLI -- "mTLS" --> GW
SBX -- "mTLS" --> GW
The PKI is a single-tier CA hierarchy generated by the openshell-bootstrap crate using rcgen. All certificates are created in a single pass at cluster bootstrap time.
openshell-ca (Self-signed Root CA, O=openshell, CN=openshell-ca)
├── openshell-server (Leaf cert, CN=openshell-server)
│ SANs: openshell, openshell.openshell.svc,
│ openshell.openshell.svc.cluster.local,
│ localhost, host.docker.internal, 127.0.0.1
│ + extra SANs for remote deployments
│
└── openshell-client (Leaf cert, CN=openshell-client)
Shared by the CLI and all sandbox pods.
Key design decisions:
- Single client certificate: One client cert is shared by the CLI and every sandbox pod. This simplifies secret management. Individual sandbox identity is not expressed at the TLS layer; post-authentication identification uses the
x-sandbox-idgRPC header. - Long-lived certificates: Certificates use
rcgendefaults (validity ~1975--4096), which effectively never expire. This is appropriate for an internal dev-cluster PKI where certificates are ephemeral to the cluster's lifetime. - CA key not persisted: The CA private key is used only during generation and is not stored in any Kubernetes secret. Re-signing requires regenerating the entire PKI.
See crates/openshell-bootstrap/src/pki.rs:35 for the generate_pki() implementation and crates/openshell-bootstrap/src/pki.rs:18 for the default SAN list.
The PKI bundle is distributed as three Kubernetes secrets in the openshell namespace:
| Secret Name | Type | Contents | Consumed By |
|---|---|---|---|
openshell-server-tls |
kubernetes.io/tls |
tls.crt (server cert), tls.key (server key) |
Gateway StatefulSet |
openshell-server-client-ca |
Opaque |
ca.crt (CA cert) |
Gateway StatefulSet (client verification) |
openshell-client-tls |
Opaque |
tls.crt (client cert), tls.key (client key), ca.crt (CA cert) |
Sandbox pods, CLI (via local filesystem) |
Secret names are defined as constants in crates/openshell-bootstrap/src/constants.rs:6-10.
The Helm StatefulSet (deploy/helm/openshell/templates/statefulset.yaml) mounts:
| Volume | Mount Path | Source Secret |
|---|---|---|
tls-cert |
/etc/openshell-tls/server/ (read-only) |
openshell-server-tls |
tls-client-ca |
/etc/openshell-tls/client-ca/ (read-only) |
openshell-server-client-ca |
Environment variables point the gateway binary to these paths:
OPENSHELL_TLS_CERT=/etc/openshell-tls/server/tls.crt
OPENSHELL_TLS_KEY=/etc/openshell-tls/server/tls.key
OPENSHELL_TLS_CLIENT_CA=/etc/openshell-tls/client-ca/ca.crt
When the gateway creates a sandbox pod (crates/openshell-server/src/sandbox/mod.rs:681), it injects:
- A volume backed by the
openshell-client-tlssecret. - A read-only mount at
/etc/openshell-tls/client/on the agent container. - Environment variables for the sandbox gRPC client:
OPENSHELL_TLS_CA=/etc/openshell-tls/client/ca.crt
OPENSHELL_TLS_CERT=/etc/openshell-tls/client/tls.crt
OPENSHELL_TLS_KEY=/etc/openshell-tls/client/tls.key
OPENSHELL_ENDPOINT=https://openshell.openshell.svc.cluster.local:8080
The CLI's copy of the client certificate bundle is written to:
$XDG_CONFIG_HOME/openshell/gateways/<gateway-name>/mtls/
├── ca.crt
├── tls.crt
└── tls.key
Files are written atomically using a temp-dir -> validate -> rename strategy with backup and rollback on failure. See crates/openshell-bootstrap/src/mtls.rs:10.
PKI provisioning occurs during deploy_cluster_with_logs() (crates/openshell-bootstrap/src/lib.rs:284). The full sequence:
- Cluster container launched -- a k3s container is created via Docker with a persistent volume.
- k3s readiness -- the bootstrap waits for k3s to become ready inside the container.
- Extra SANs computed -- for remote deployments, the SSH destination hostname and its resolved IP are added to the server certificate's SANs. For local deployments, the detected gateway host (if any) is added.
reconcile_pki()called (crates/openshell-bootstrap/src/lib.rs:515):- Wait for the
openshellnamespace to exist (created by the Helm controller). - Attempt to load existing PKI from the three K8s secrets via
kubectl get secretexec'd inside the container. Each field is base64-decoded and validated for PEM markers. - If secrets exist and are valid: reuse them and return
rotated=false. - If secrets are missing, incomplete, or malformed: generate fresh PKI via
generate_pki(), apply all three secrets viakubectl apply, and returnrotated=true.
- Wait for the
- Workload restart on rotation -- if
rotated=trueand the openshell StatefulSet already exists, the bootstrap performskubectl rollout restartand waits for completion. This ensures the server picks up new TLS secrets before the CLI writes its local copy. - CLI-side credential storage --
store_pki_bundle()writesca.crt,tls.crt,tls.keyto the local filesystem.
sequenceDiagram
participant CLI as nav deploy
participant Docker as Cluster Container
participant K8s as k3s / K8s API
CLI->>Docker: Create container, wait for k3s
CLI->>K8s: Wait for openshell namespace
CLI->>K8s: Read existing TLS secrets
alt Secrets valid
CLI->>CLI: Reuse existing PKI
else Secrets missing/invalid
CLI->>CLI: generate_pki(extra_sans)
CLI->>K8s: kubectl apply (3 secrets)
alt Workload exists
CLI->>K8s: kubectl rollout restart
CLI->>K8s: Wait for rollout complete
end
end
CLI->>CLI: store_pki_bundle() to local filesystem
The gateway supports three transport modes:
- mTLS (default) -- TLS is enabled and client certificates are required.
- Dual-auth TLS -- TLS is enabled, but the handshake also accepts clients without certificates (
allow_unauthenticated=true). This is used for Cloudflare Tunnel deployments where the edge authenticates the user and forwards a Cloudflare JWT to the gateway. - Plaintext behind edge -- TLS is disabled at the gateway and the service listens on HTTP behind a trusted reverse proxy or tunnel.
TlsAcceptor::from_files() (crates/openshell-server/src/tls.rs:27) constructs the rustls::ServerConfig:
- Server identity: loads the server certificate and private key from PEM files (supports PKCS#1, PKCS#8, and SEC1 key formats).
- Client verification: builds a
WebPkiClientVerifierfrom the CA certificate. In the default mode it requires a valid client certificate; in dual-auth mode it also accepts no-certificate clients and defers authentication to the HTTP/gRPC layer. - ALPN: advertises
h2andhttp/1.1for protocol negotiation.
TCP accept
→ TLS handshake (mandatory client cert in mTLS mode, optional in dual-auth mode)
→ hyper auto-negotiates HTTP/1.1 or HTTP/2 via ALPN
→ MultiplexedService routes by content-type:
├── application/grpc → GrpcRouter
└── other → Axum HTTP Router
All traffic shares a single port. When TLS is enabled, the TLS handshake occurs before any HTTP parsing. In plaintext mode, the gateway expects an upstream reverse proxy or tunnel to be the outer security boundary.
Cloudflare-fronted gateways add two HTTP endpoints on the same multiplexed port:
/auth/connect-- browser login relay that reads theCF_Authorizationcookie server-side and POSTs the token back to the CLI's localhost callback server./_ws_tunnel-- WebSocket upgrade endpoint used to carry gRPC and SSH bytes through Cloudflare Access.
The WebSocket tunnel bridges directly into the gateway's MultiplexedService over an in-memory duplex stream. It does not re-enter the public listener, so it behaves the same whether the public listener is plaintext or TLS-backed.
The e2e test suite (e2e/python/test_security_tls.py) validates four scenarios:
| Scenario | Result |
|---|---|
| Client presents correct mTLS cert | HEALTHY response |
| Client trusts CA but presents no client cert | UNAVAILABLE -- handshake terminated |
| Client presents cert signed by a different CA | UNAVAILABLE -- handshake terminated |
| Client connects with plaintext (no TLS) | UNAVAILABLE -- transport failure |
Sandbox pods connect back to the gateway at startup to fetch their policy and provider credentials. The gRPC client (crates/openshell-sandbox/src/grpc_client.rs:18) reads three environment variables to configure mTLS:
| Env Var | Value |
|---|---|
OPENSHELL_TLS_CA |
/etc/openshell-tls/client/ca.crt |
OPENSHELL_TLS_CERT |
/etc/openshell-tls/client/tls.crt |
OPENSHELL_TLS_KEY |
/etc/openshell-tls/client/tls.key |
These are used to build a tonic::transport::ClientTlsConfig with:
ca_certificate()-- verifies the server's certificate against the cluster CA.identity()-- presents the shared client certificate for mTLS.
The sandbox calls two RPCs over this authenticated channel:
GetSandboxSettings-- fetches the YAML policy that governs the sandbox's behavior.GetSandboxProviderEnvironment-- fetches provider credentials as environment variables.
SSH connections into sandboxes pass through the gateway's HTTP CONNECT tunnel at /connect/ssh. This adds a second authentication layer on top of mTLS.
| Header | Purpose |
|---|---|
x-sandbox-id |
Identifies the target sandbox |
x-sandbox-token |
Session token (created via CreateSshSession RPC) |
The gateway validates the token against the stored SshSession record and checks:
- The token has not been revoked.
- The
sandbox_idmatches the request header. - The token has not expired (
expires_at_mscheck; 0 means no expiry for backward compatibility).
SSH session tokens have a configurable TTL (ssh_session_ttl_secs, default 24 hours). The expires_at_ms field is set at creation time and checked on every tunnel request. Setting the TTL to 0 disables expiry.
Sessions are cleaned up automatically:
- On sandbox deletion: all SSH sessions for the deleted sandbox are removed from the store.
- Background reaper: a periodic task (hourly) deletes expired and revoked session records to prevent unbounded database growth.
The gateway enforces two concurrent connection limits to bound the impact of credential misuse:
| Limit | Value | Purpose |
|---|---|---|
| Per-token | 10 concurrent tunnels | Limits damage from a single leaked token |
| Per-sandbox | 20 concurrent tunnels | Prevents bypass via creating many tokens for one sandbox |
These limits are tracked in-memory and decremented when tunnels close. Exceeding either limit returns HTTP 429 (Too Many Requests).
The gateway never dials the sandbox. Instead, the sandbox supervisor opens an outbound ConnectSupervisor bidirectional gRPC stream to the gateway on startup and keeps it alive for the sandbox lifetime. SSH traffic for /connect/ssh (and exec traffic for ExecSandbox) rides this same TCP+TLS+HTTP/2 connection as separate multiplexed HTTP/2 streams. The gateway-side registry and RelayStream handler live in crates/openshell-server/src/supervisor_session.rs; the supervisor-side bridge lives in crates/openshell-sandbox/src/supervisor_session.rs.
Per-connection flow:
- CLI presents
x-sandbox-id+x-sandbox-tokenat/connect/sshand passes gateway token validation. - Gateway calls
SupervisorSessionRegistry::open_relay(sandbox_id, ...), which allocates achannel_id(UUID) and sends aRelayOpenmessage to the supervisor over the already-establishedConnectSupervisorstream. If no session is registered yet, it polls with exponential backoff up to a bounded timeout (30 s for/connect/ssh, 15 s forExecSandbox). - The supervisor opens a new
RelayStreamRPC on the sameChannel— a new HTTP/2 stream, no new TCP connection and no new TLS handshake. The firstRelayFrameis aRelayInit { channel_id }that claims the pending slot on the gateway. claim_relaypairs the gateway-side waiter with the supervisor-side RPC via atokio::io::duplex(64 KiB)pair. SubsequentRelayFrame::dataframes carry raw SSH bytes in both directions. The supervisor is a dumb byte bridge: it has no protocol awareness of the SSH bytes flowing through.- Inside the sandbox pod, the supervisor connects the relay to sshd over a Unix domain socket at
/run/openshell/ssh.sock(seecrates/openshell-driver-kubernetes/src/main.rs).
Security properties of this model:
- One auth boundary. mTLS on the
ConnectSupervisorstream is the only identity check between gateway and sandbox. Every relay rides that same authenticated HTTP/2 connection. - No inbound network path into the sandbox. The sandbox exposes no TCP port for gateway ingress; all relays are supervisor-initiated. The pod only needs egress to the gateway.
- In-pod access control is filesystem permissions on the Unix socket. sshd listens on
/run/openshell/ssh.sockwith the parent directory at0700and the socket itself at0600, both owned by the supervisor (root). The sandbox entrypoint runs as an unprivileged user and cannot open either. Any process in the supervisor's filesystem view that can open the socket can reach sshd — same trust model as any local Unix socket with0600permissions. Seecrates/openshell-sandbox/src/ssh.rs:55-83. - Supersede race is closed. A supervisor reconnect registers a new
session_idfor the same sandbox id. Cleanup on the old session's task usesremove_if_current(sandbox_id, session_id)so a late-finishing old task cannot evict the new registration or serve relays meant for the new instance. SeeSupervisorSessionRegistry::remove_if_currentincrates/openshell-server/src/supervisor_session.rs. - Pending-relay reaper. A background task sweeps
pending_relaysentries older than 10 s (RELAY_PENDING_TIMEOUT). If the supervisor acknowledgesRelayOpenbut never initiatesRelayStream— crash, deadlock, or adversarial stall — the gateway-side slot does not pin indefinitely. - Client-side keepalives. The CLI's
sshinvocation setsServerAliveInterval=15/ServerAliveCountMax=3(crates/openshell-cli/src/ssh.rs:150), so a silently-dropped relay (gateway restart, supervisor restart, or adversarial TCP drop) surfaces to the user within roughly 45 s rather than hanging.
Observability (sandbox side, OCSF): session_established, session_closed, session_failed, relay_open, relay_closed, relay_failed, relay_close_from_gateway — all emitted as NetworkActivity events. Gateway-side OCSF emission for the same lifecycle is a tracked follow-up.
Traffic flows through several layers from the host to the gateway process:
| Layer | Port | Configurable Via |
|---|---|---|
| Host (Docker) | 8080 (default) |
--port flag on nav deploy |
| Container | 30051 |
Hardcoded in crates/openshell-bootstrap/src/docker.rs |
| k3s NodePort | 30051 |
deploy/helm/openshell/values.yaml (service.nodePort) |
| k3s Service | 8080 |
deploy/helm/openshell/values.yaml (service.port) |
| Server bind | 8080 |
--port flag / OPENSHELL_SERVER_PORT env var |
Docker maps host_port → 30051/tcp. Inside k3s, the NodePort service maps 30051 → 8080 (pod port). The server binds 0.0.0.0:8080.
graph LR
subgraph EXTERNAL["External"]
CLI["CLI"]
SDK["SDK"]
end
subgraph GW["Gateway (mTLS boundary)"]
TLS["TLS Termination<br/>(WebPkiClientVerifier)"]
API["gRPC + HTTP API"]
end
subgraph KUBE["Kubernetes"]
SBX["Sandbox Pod"]
end
subgraph INET["Internet"]
HOSTS["Allowed Hosts"]
end
CLI -- "mTLS<br/>(cluster CA)" --> TLS
SDK -- "mTLS<br/>(cluster CA)" --> TLS
TLS --> API
SBX -- "mTLS + ConnectSupervisor<br/>(supervisor-initiated)" --> TLS
API -- "RelayStream<br/>(HTTP/2 on same mTLS conn)" --> SBX
SBX -- "OPA policy +<br/>process identity" --> HOSTS
| Boundary | Mechanism |
|---|---|
| External → Gateway | mTLS with cluster CA by default, or trusted reverse-proxy/Cloudflare boundary in edge mode |
| Sandbox → Gateway | mTLS with shared client cert (supervisor-initiated ConnectSupervisor stream) |
| Gateway → Sandbox (SSH/exec) | Rides the supervisor's mTLS ConnectSupervisor HTTP/2 connection as a RelayStream — no separate gateway-to-pod connection |
| Supervisor → in-pod sshd | Unix-socket filesystem permissions (/run/openshell/ssh.sock, 0700 parent / 0600 socket) |
| Sandbox → External (network) | OPA policy + process identity binding via /proc |
- Individual sandbox identity at the TLS layer: all sandboxes share one client certificate (
CN=openshell-client). Post-TLS identification uses thex-sandbox-idgRPC metadata header, which is trusted because it arrives over an mTLS-authenticated channel. - Health endpoints in reverse-proxy mode: when the gateway is deployed behind Cloudflare or another trusted edge,
/health,/healthz, and/readyzare protected by that upstream boundary rather than by direct mTLS at the gateway.
The gateway container runs with a hardened security context (deploy/helm/openshell/values.yaml:25):
securityContext:
runAsNonRoot: true
runAsUser: 1000
allowPrivilegeEscalation: false
capabilities:
drop:
- ALLThe gateway process has no elevated privileges and drops all Linux capabilities.
This section defines the primary attacker profiles, what the current design protects, and where residual risk remains.
- Prevent unauthenticated access to gateway APIs and SSH tunneling.
- Prevent unauthorized sandbox access across tenants/sessions.
- Protect sandbox-to-gateway policy and credential exchange in transit.
- Limit impact from network-level attackers and accidental misconfiguration.
| Threat Actor | Example Capability |
|---|---|
| Network attacker | Can observe/modify traffic between clients and gateway |
| Unauthorized external client | Can reach gateway port but has no valid client cert |
| Compromised sandbox workload | Has code execution inside one sandbox pod |
| Malicious in-cluster pod | Can attempt direct pod-to-pod connections |
| Stolen CLI credentials | Has copied ca.crt/tls.crt/tls.key from a developer machine |
| Threat | Existing Defense | Notes |
|---|---|---|
| MITM or passive interception of gateway traffic | Mandatory mTLS with cluster CA, or trusted reverse-proxy boundary in Cloudflare mode | Default mode is direct mTLS; reverse-proxy mode shifts the outer trust boundary upstream |
| Unauthenticated API/health access | mTLS by default, or Cloudflare/reverse-proxy auth in edge mode | /health* are direct-mTLS only in the default deployment mode |
| Forged SSH tunnel connection to sandbox | Session token validation at the gateway; only the supervisor's authenticated mTLS ConnectSupervisor stream can carry a RelayStream to its sandbox |
Forging a relay requires stealing a valid mTLS client identity |
| Direct access to sandbox sshd from cluster peers | sshd listens on a Unix socket (0700 parent / 0600 socket) inside the pod |
No network path exists to sshd from cluster peers |
| Stale or reconnecting supervisor serves relays for a new instance | session_id-scoped remove_if_current on the registry |
Old session cleanup cannot evict a newer registration |
Supervisor acknowledges RelayOpen but never initiates RelayStream |
Gateway-side pending-relay reaper (10 s timeout) | Prevents indefinite resource pinning by a buggy or malicious supervisor |
| Silent TCP drop of an in-flight relay | CLI ServerAliveInterval=15 / ServerAliveCountMax=3 |
Client detects a dead relay within ~45 s instead of hanging |
| Unauthorized outbound internet access from sandbox | OPA policy + process identity checks | Applies to sandbox egress policy layer |
| Risk | Why It Exists |
|---|---|
| No per-sandbox TLS identity | All sandboxes and CLI share one client certificate |
| Broad blast radius on key compromise | Shared client key reuse across multiple components |
| Weak cryptoperiod | Certificates are effectively non-expiring by default |
| Limited fine-grained revocation | CA private key is not persisted; rotation is coarse-grained |
| Local credential theft risk | CLI mTLS key material is stored on developer filesystem |
| SSH token + mTLS = persistent access within trust boundary | SSH tokens expire after 24h (configurable) and are capped at 3 concurrent connections per token / 20 per sandbox, but within the mTLS trust boundary a stolen token remains usable until TTL expires |
- A fully compromised Kubernetes control plane or cluster-admin account.
- A malicious actor with direct access to Kubernetes secrets in the
openshellnamespace. - Host-level compromise of the developer workstation running the CLI.
- Application-layer authorization bugs after mTLS authentication succeeds.
- The cluster CA is generated and distributed without interception during bootstrap.
- Kubernetes secret access is restricted to intended workloads and operators.
- Gateway and sandbox container images are trusted and not tampered with.
- The sandbox pod's filesystem is trusted: only the supervisor process (root) can open
/run/openshell/ssh.sock, which is enforced by the0700parent directory and0600socket permissions set at sshd start.
Separate from the cluster mTLS infrastructure, each sandbox has an independent TLS capability for inspecting outbound HTTPS traffic. This is documented here for completeness because it involves a distinct, per-sandbox PKI.
The sandbox proxy automatically detects and terminates TLS on outbound HTTPS connections by peeking the first bytes of each tunnel. This enables credential injection and L7 inspection without requiring explicit policy configuration. The proxy performs TLS man-in-the-middle inspection:
- Ephemeral sandbox CA: a per-sandbox CA (
CN=OpenShell Sandbox CA, O=OpenShell) is generated at sandbox startup. This CA is completely independent of the cluster mTLS CA. - Trust injection: the sandbox CA is written to the sandbox filesystem and injected via
NODE_EXTRA_CA_CERTSandSSL_CERT_FILEso processes inside the sandbox trust it. - Dynamic leaf certs: for each target hostname, the proxy generates and caches a leaf certificate signed by the sandbox CA (up to 256 entries).
- Upstream verification: the proxy verifies upstream server certificates against Mozilla root CAs (
webpki-roots) and system CA certificates from the container's trust store, not against the cluster CA. Custom sandbox images can add corporate/internal CAs viaupdate-ca-certificates.
This capability is orthogonal to gateway mTLS -- it operates only on sandbox-to-internet traffic and uses entirely separate key material. See Policy Language for configuration details.
- Gateway Architecture -- protocol multiplexing, gRPC services, persistence, and SSH tunneling
- Cluster Bootstrap -- cluster provisioning, Docker container lifecycle, and credential management
- Sandbox Architecture -- sandbox-side isolation, proxy, and policy enforcement
- Sandbox Connect -- client-side SSH connection flow through the gateway
- Policy Language -- YAML/Rego policy system including L7 TLS inspection configuration