-
Notifications
You must be signed in to change notification settings - Fork 400
Description
Agent Diagnostic
- Pointed coding agent at the OpenShell repository and inspected the relevant files:
crates/openshell-sandbox/src/l7/rest.rs,relay.rs,provider.rs, andmod.rs - Identified that
is_bodiless_response()treats 101 Switching Protocols as a generic 1xx informational response. After forwarding the 101 headers, the relay loop returns to HTTP parsing instead of switching to raw TCP passthrough. - Confirmed that
access: fullonly expands to broad HTTP-level rules (rule_json("*", "**")inexpand_access_presets()). It does not provide true raw TCP passthrough — L7 inspection continues inside the CONNECT tunnel. - Verified with a raw Python socket test: the
HTTP/1.1 101 Switching Protocolsresponse arrives correctly, but all subsequent WebSocket frames are dropped (timeout). The same test run directly from the host (bypassing the proxy) receives both the 101 and the WebSocket frames immediately. - Proxy logs confirm continued L7 inspection even after the upgrade:
[openshell_sandbox::l7::relay] HTTP_REQUEST ... - Wrote and tested a patch: detect 101 in
relay_response(), addedRelayOutcome::Upgradedvariant, and switch totokio::io::copy_bidirectionalafter upgrade. All 322 existing tests pass + new test for 101 behavior. - Built the patched binary (cross-compiled for linux/arm64 via Docker), deployed it to the K3s container via hostPath, and verified that WebSocket frames now relay end-to-end through the proxy.
- Consulted Grok (xAI) for architectural guidance on the multi-hop networking and security implications throughout the investigation.
Description
Summary
When an application inside an OpenShell sandbox establishes a WebSocket connection through the egress proxy (via HTTP CONNECT tunnel), the proxy correctly relays the HTTP/1.1 101 Switching Protocols response but drops all subsequent WebSocket frames. This makes persistent bidirectional WebSocket connections impossible from inside a sandbox.
This appears to be a deeper manifestation of (or closely related to) the existing issue #409 ("openshell-sandbox egress proxy kills WebSocket connections after ~2 minutes").
Root Cause (from code inspection)
In crates/openshell-sandbox/src/l7/rest.rs, the is_bodiless_response() function returns true for all 1xx status codes (including 101):
fn is_bodiless_response(request_method: &str, status_code: u16) -> bool {
request_method.eq_ignore_ascii_case("HEAD")
|| (100..200).contains(&status_code) // 101 falls here
|| status_code == 204
|| status_code == 304
}After forwarding the 101 headers, relay_response() returns control to the HTTP parsing loop, which then tries to parse the next HTTP request from the client. However, the connection has been upgraded — the subsequent bytes are WebSocket frames (binary), not valid HTTP. This causes the relay to either fail silently or block waiting for more HTTP data.
Additionally, access: full only provides HTTP-level permissiveness. L7 inspection continues inside the CONNECT tunnel even with this policy, as confirmed by proxy logs showing HTTP_REQUEST after the CONNECT is allowed.
Impact
This blocks any sandboxed application that needs persistent bidirectional WebSocket communication, including:
- OpenClaw node → remote OpenClaw gateway meshes (connect.challenge frame is dropped)
- Discord/Slack bot integrations
- Any long-lived protocol-upgraded connections
Workaround
A custom patch that detects 101 and switches to raw tokio::io::copy_bidirectional works reliably. We have a tested version and are happy to submit a PR.
Acknowledgments
This issue was diagnosed and fixed collaboratively. Claude (Anthropic) performed the deep code analysis and wrote the patch. Grok (xAI) provided architectural and security guidance. The human drove the investigation and maintained strict security requirements throughout.
Reproduction Steps
- Create an OpenShell sandbox with a network policy that allows the destination (protocol: tcp, access: full).
- Install socat inside the sandbox.
- Start a socat PROXY listener inside the sandbox:
socat TCP-LISTEN:18789,reuseaddr,fork PROXY:10.200.0.1::,proxyport=3128 & - Run the following raw WebSocket test from inside the sandbox:
import socket, time
s = socket.socket()
s.settimeout(15)
s.connect(('127.0.0.1', 18789))
req = (
b'GET / HTTP/1.1\r\n'
b'Host: localhost\r\n'
b'Upgrade: websocket\r\n'
b'Connection: Upgrade\r\n'
b'Sec-WebSocket-Key: RylUQAh3p5cysfOlexgubw==\r\n'
b'Sec-WebSocket-Version: 13\r\n'
b'\r\n'
)
s.send(req)
time.sleep(1)
resp1 = s.recv(4096)
print('RECV1:', repr(resp1[:300]))
time.sleep(3)
try:
resp2 = s.recv(4096)
print('RECV2:', repr(resp2[:200]))
except socket.timeout:
print('RECV2: TIMEOUT - WebSocket frames not relayed')
s.close()- Expected: RECV2 contains WebSocket frame data from the server.
- Actual: RECV2 times out — no WebSocket frames are relayed after the 101 response.
- Control test: Running the same test directly from the host (bypassing the sandbox proxy) receives both the 101 and the subsequent frames immediately.
Environment
- OS: macOS 15.6.1 (Apple Silicon M4 Max)
- Docker: OrbStack (Docker Engine 28.5.2)
- OpenShell: 0.0.16
- K3s cluster image: ghcr.io/nvidia/openshell/cluster:0.0.16
- Sandbox image: ghcr.io/nvidia/openshell-community/sandboxes/openclaw:latest
- OpenClaw: 2026.3.11 (both gateway and node)
Logs
Proxy log showing continued L7 HTTP inspection inside the CONNECT tunnel (after CONNECT is allowed):
[sandbox] [INFO] [openshell_sandbox::proxy] CONNECT action=allow ancestors=/usr/local/bin/socat1 -> /opt/openshell/bin/openshell-sandbox binary=/usr/local/bin/socat1 dst_host=100.77.240.62 dst_port=18789 engine=opa policy=openclaw_gateway proxy_addr=10.200.0.1:3128 src_addr=10.200.0.2 src_port=33246
[sandbox] [INFO] [openshell_sandbox::l7::relay] HTTP_REQUEST credentials_injected=false host=100.77.240.62 method=GET path=/ port=18789 request_num=1Agent-First Checklist
- I pointed my agent at the repo and had it investigate this issue
- I loaded relevant skills (e.g.,
debug-openshell-cluster,debug-inference,openshell-cli) - My agent could not resolve this — the diagnostic above explains why