Skip to content

fix(bootstrap,server): persist sandbox state across gateway stop/start cycles#739

Merged
drew merged 3 commits intomainfrom
738-sandbox-persistence-across-gateway-restart
Apr 3, 2026
Merged

fix(bootstrap,server): persist sandbox state across gateway stop/start cycles#739
drew merged 3 commits intomainfrom
738-sandbox-persistence-across-gateway-restart

Conversation

@drew
Copy link
Copy Markdown
Collaborator

@drew drew commented Apr 2, 2026

Summary

Sandbox pod data was lost whenever the gateway was stopped and restarted. Two independent bugs caused this: k3s used the container ID as its node name (which changes on container recreation, triggering PVC deletion), and sandbox pods had no persistent storage by default.

Related Issue

Fixes #738

Changes

  • Deterministic k3s node name: Added node_name() to constants.rs and pass OPENSHELL_NODE_NAME env var to the gateway container. The entrypoint script uses --node-name so the k3s node identity survives container recreation. clean_stale_nodes() now compares against the expected node name instead of running hostname inside the container.
  • Default workspace PVC: Sandbox pods now get a default 1Gi volumeClaimTemplate named "workspace" mounted at /sandbox. This ensures user files, installed packages, etc. survive pod rescheduling across gateway restarts.

Testing

  • mise run pre-commit passes
  • Unit tests added/updated (2 new tests for workspace mount injection logic)
  • E2E tests added/updated (if applicable)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

@drew drew requested a review from a team as a code owner April 2, 2026 19:21
@drew drew self-assigned this Apr 2, 2026
@drew drew added the test:e2e Requires end-to-end coverage label Apr 2, 2026
pimlock
pimlock previously approved these changes Apr 2, 2026
@drew drew force-pushed the 738-sandbox-persistence-across-gateway-restart branch 6 times, most recently from 871adc9 to 6da0e77 Compare April 3, 2026 04:37
…t cycles

Two changes to preserve sandbox state across gateway restarts:

1. Deterministic k3s node identity: Set the Docker container hostname to
   a deterministic name derived from the gateway name (openshell-{name}).
   Pass OPENSHELL_NODE_NAME env var and --node-name flag to k3s via the
   cluster entrypoint as belt-and-suspenders.  Update clean_stale_nodes()
   to prefer the deterministic name with a fallback to the container
   hostname for backward compatibility with older cluster images.

   This prevents clean_stale_nodes() from deleting PVCs (including the
   server's SQLite database) when the container is recreated after an
   image upgrade.

2. Default workspace persistence: Inject a 2Gi PVC and init container
   into every sandbox pod so the /sandbox directory survives pod
   rescheduling.  The init container uses the same sandbox image, mounts
   the PVC at a temporary path, and copies the image's /sandbox contents
   (Python venv, dotfiles, skills) into the PVC on first use — guarded
   by a sentinel file so subsequent restarts are instant.  The agent
   container then mounts the populated PVC at /sandbox.  Users who
   supply custom volumeClaimTemplates are unaffected — the default
   workspace is skipped.

Fixes #738
@drew drew force-pushed the 738-sandbox-persistence-across-gateway-restart branch from 6da0e77 to 38892c3 Compare April 3, 2026 04:40
drew added 2 commits April 2, 2026 22:24
On resume, skip the image pull only when the container still exists.
When only the volume survives (container was removed), the image pull
must proceed so the container can be recreated. This fixes the
'container removal resumes' e2e scenario where the image was not
available after the container was force-removed.
@drew drew merged commit 491c5d8 into main Apr 3, 2026
15 of 16 checks passed
@drew drew deleted the 738-sandbox-persistence-across-gateway-restart branch April 3, 2026 06:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: sandbox pod state lost across gateway stop/start cycles

2 participants