Skip to content

Guard WebSocket lifecycle with connection generations#739

Open
christineyan4 wants to merge 4 commits into
openclaw:mainfrom
christineyan4:fix-696-websocket-generation-guard
Open

Guard WebSocket lifecycle with connection generations#739
christineyan4 wants to merge 4 commits into
openclaw:mainfrom
christineyan4:fix-696-websocket-generation-guard

Conversation

@christineyan4

@christineyan4 christineyan4 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Fixes #696.

Summary

  • add a per-connection generation guard so stale connect/listen work cannot act on newer WebSocket connections
  • make listen loops operate on their captured socket and skip stale message, disconnect, error, and reconnect handling
  • add a loopback WebSocket regression test for overlapping connects

Validation

  • .\build.ps1
  • dotnet test .\tests\OpenClaw.Shared.Tests\OpenClaw.Shared.Tests.csproj --no-restore
  • dotnet test .\tests\OpenClaw.Tray.Tests\OpenClaw.Tray.Tests.csproj --no-restore
  • dotnet test .\tests\OpenClaw.Connection.Tests\OpenClaw.Connection.Tests.csproj --no-restore
  • dotnet test .\tests\OpenClaw.Shared.Tests\OpenClaw.Shared.Tests.csproj --no-restore --filter "FullyQualifiedName~WebSocketClientBaseTests"
  • git diff --check

Review

  • Ran Hanselman-style dual-model adversarial review and addressed the actionable findings: stale text message processing and test flake risk.

Proof

WebSocketClientBase tests passing

christineyan4 and others added 2 commits June 10, 2026 12:27
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@clawsweeper

clawsweeper Bot commented Jun 10, 2026

Copy link
Copy Markdown

Codex review: needs changes before merge. Reviewed June 10, 2026, 3:01 PM ET / 19:01 UTC.

Summary
The PR adds connection-generation guards to shared WebSocket connect/listen handling and a loopback regression test for overlapping connects.

Reproducibility: yes. Let generation N enter reconnect backoff, establish generation N+1, then let N resume; the old loop disposes the shared N+1 socket before reconnecting.

Review metrics: 2 noteworthy metrics.

  • Guard coverage: connect/listen guarded; reconnect unguarded. The remaining shared-socket mutation occurs after the new checks and can recreate the availability failure being fixed.
  • Patch surface: 2 files, 255 additions, 18 deletions. A focused production change is accompanied by a substantial custom loopback test fixture.

Merge readiness
Overall: 🦪 silver shellfish
Proof: 🐚 platinum hermit
Patch quality: 🦪 silver shellfish
Result: blocked by patch quality or review findings.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P2] Make reconnect backoff generation-aware and add the takeover regression test.

Risk before merge

  • [P1] A delayed reconnect from an older generation can close a healthy newer operator or node socket and restart the user-visible reconnect cycle.

Maintainer options:

  1. Scope reconnects to generations (recommended)
    Carry the initiating generation through backoff and exit before disposing or replacing the current socket when a newer generation has taken ownership.
  2. Pause the partial fix
    Do not merge while delayed reconnect work from a stale generation can still tear down the active connection.
Copy recommended automerge instruction
@clawsweeper automerge

Special instructions:
Bind reconnect attempts to the initiating connection generation, abort stale backoff loops before they dispose or replace the current socket, and add a focused regression test where a newer connection succeeds while an older generation is waiting to reconnect.

Next step before merge

  • [P2] The blocking race has a narrow, mechanical code-and-test repair path on this PR.

Security
Cleared: The two-file patch has no concrete credential, authorization, dependency, workflow, or supply-chain concern.

Review findings

  • [P1] Bind the reconnect loop to the initiating generation — src/OpenClaw.Shared/WebSocketClientBase.cs:306-312
Review details

Best possible solution:

Carry socket and generation ownership through reconnect backoff, abort superseded loops before any shared-state mutation, and cover takeover during the delay with a deterministic regression test.

Do we have a high-confidence way to reproduce the issue?

Yes. Let generation N enter reconnect backoff, establish generation N+1, then let N resume; the old loop disposes the shared N+1 socket before reconnecting.

Is this the best way to solve the issue?

No, not yet. Generation guarding is the right ownership boundary, but reconnect backoff must retain and revalidate the initiating generation to close the full race.

Full review comments:

  • [P1] Bind the reconnect loop to the initiating generation — src/OpenClaw.Shared/WebSocketClientBase.cs:306-312
    The listener checks generation only before awaiting ReconnectWithBackoffAsync. If a newer ConnectAsync succeeds during the delay, the old loop resumes and disposes that newer _webSocket at lines 348-350. Carry the initiating socket/generation into reconnect handling and exit before touching shared state when superseded; add a during-backoff takeover test.
    Confidence: 0.97

Overall correctness: patch is incorrect
Overall confidence: 0.97

AGENTS.md: found and applied where relevant.

Codex review notes: reasoning high; reviewed against 5505a85da7df.

Label changes

Label changes:

  • add P2: This is a bounded connection-stability fix for a confirmed self-hosted WSS workflow rather than a repository-wide outage.
  • add merge-risk: 🚨 availability: A delayed stale reconnect can dispose a healthy newer WebSocket after merge.
  • add proof: sufficient: Contributor real behavior proof is sufficient. The linked reporter states they ran the connection-generation branch against a real self-hosted WSS gateway and observed the reconnect loop disappear with a stable operator connection.
  • add rating: 🦪 silver shellfish: Overall readiness is 🦪 silver shellfish; proof is 🐚 platinum hermit and patch quality is 🦪 silver shellfish.
  • add status: ⏳ waiting on author: ClawSweeper has contributor-facing work open and is waiting for author action. Sufficient (live_output): The linked reporter states they ran the connection-generation branch against a real self-hosted WSS gateway and observed the reconnect loop disappear with a stable operator connection.

Label justifications:

  • P2: This is a bounded connection-stability fix for a confirmed self-hosted WSS workflow rather than a repository-wide outage.
  • merge-risk: 🚨 availability: A delayed stale reconnect can dispose a healthy newer WebSocket after merge.
  • rating: 🦪 silver shellfish: Overall readiness is 🦪 silver shellfish; proof is 🐚 platinum hermit and patch quality is 🦪 silver shellfish.
  • status: ⏳ waiting on author: ClawSweeper has contributor-facing work open and is waiting for author action. Sufficient (live_output): The linked reporter states they ran the connection-generation branch against a real self-hosted WSS gateway and observed the reconnect loop disappear with a stable operator connection.
  • proof: sufficient: Contributor real behavior proof is sufficient. The linked reporter states they ran the connection-generation branch against a real self-hosted WSS gateway and observed the reconnect loop disappear with a stable operator connection.
Evidence reviewed

Acceptance criteria:

  • [P1] ./build.ps1.
  • [P1] dotnet test ./tests/OpenClaw.Shared.Tests/OpenClaw.Shared.Tests.csproj --no-restore.
  • [P1] dotnet test ./tests/OpenClaw.Tray.Tests/OpenClaw.Tray.Tests.csproj --no-restore.
  • [P1] dotnet test ./tests/OpenClaw.Connection.Tests/OpenClaw.Connection.Tests.csproj --no-restore.
  • [P1] dotnet test ./tests/OpenClaw.Shared.Tests/OpenClaw.Shared.Tests.csproj --no-restore --filter "FullyQualifiedName~WebSocketClientBaseTests".

What I checked:

Likely related people:

  • Scott Hanselman: Extracted WebSocketClientBase and has several later changes in the same file history. (role: introduced shared base and recent area contributor; confidence: high; commits: 76f7811a1448, d23f8ca50013; files: src/OpenClaw.Shared/WebSocketClientBase.cs, tests/OpenClaw.Shared.Tests/WebSocketClientBaseTests.cs)
  • Sytone: Commit 1d83639 introduced reconnect hardening and adjacent gateway lifecycle behavior. (role: reconnect hardening contributor; confidence: high; commits: 1d836390e9ee; files: src/OpenClaw.Shared/WebSocketClientBase.cs)
  • ranjeshj: Commit 9d304c1 changed authentication-wait recovery along the same reconnect path shortly before this report. (role: recent connection recovery contributor; confidence: medium; commits: 9d304c110c0f; files: src/OpenClaw.Shared/WebSocketClientBase.cs)
  • christineyan4: Prior merged current-main work touched the shared WebSocket base, providing relevant area history beyond authorship of this PR. (role: recent current-main area contributor; confidence: medium; commits: 85445c78066b; files: src/OpenClaw.Shared/WebSocketClientBase.cs)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added proof: sufficient Contributor real behavior proof is sufficient. rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. P2 Normal priority bug or improvement with limited blast radius. merge-risk: 🚨 availability 🚨 Merging this PR could cause crashes, hangs, restart loops, stalls, or process outages. labels Jun 10, 2026
christineyan4 and others added 2 commits June 10, 2026 15:35
Bind listener-initiated reconnect loops to the socket generation that requested them so a stale loop cannot dispose or replace a newer successful connection after backoff delay.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@christineyan4

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

@clawsweeper

clawsweeper Bot commented Jun 10, 2026

Copy link
Copy Markdown

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 availability 🚨 Merging this PR could cause crashes, hangs, restart loops, stalls, or process outages. P2 Normal priority bug or improvement with limited blast radius. proof: sufficient Contributor real behavior proof is sufficient. rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reconnect / "re-approval, repair" loop connecting to a pre-existing (self-hosted) gateway over wss

1 participant