test: reproducer for write tool hanging on slow LSP initialize (related to #22872)#22884
Closed
kitlangton wants to merge 3 commits into
Closed
test: reproducer for write tool hanging on slow LSP initialize (related to #22872)#22884kitlangton wants to merge 3 commits into
kitlangton wants to merge 3 commits into
Conversation
…22872) Adds a failing regression test that reproduces the write tool hang reported in #22872. The write tool calls lsp.touchFile + lsp.diagnostics to enrich its output; if a matching LSP server spawns but never responds to the initialize request, the tool blocks on LSPClient.create's 45s withTimeout. The test configures a fake LSP server (hanging-lsp-server.js) that swallows every message and never replies, asserts the file is still written correctly, and checks the tool returns within 10s. On dev today the assertion fails with ~45s actual, proving the hang. The fix should make this green by bounding the diagnostic-enrichment tail.
Adds a second reproducer covering the 'forever' branch of issue #22872: when Pyright.spawn calls Npm.which('pyright') and the npm registry is unreachable (sandboxed container), arborist.reify blocks indefinitely with no timeout. Changes: - Adds optional Info.spawnEffect alongside the existing async Info.spawn. spawnEffect returns an Effect that can yield from Npm.Service, making npm lookups injectable for tests. - Migrates Pyright to use spawnEffect, pulling the venv probing logic into a reusable pyrightVenvInitialization helper. The legacy async spawn stays for backwards compatibility. - Threads Npm.Service through LSP.layer so getClients captures a stable reference and uses it for any server that provides spawnEffect. - Adds test/tool/write-lsp-spawn-hang.test.ts — mocks Npm.Service.which with Effect.never and asserts the write tool still returns in < 10s. Fails today (hangs forever); the fix must bound the touchFile tail so the tool cannot wait on a wedged LSP spawn. The two reproducers now cover both hang branches: - write-lsp-hang.test.ts: 45s LSPClient.create initialize timeout - write-lsp-spawn-hang.test.ts: unbounded Npm.which arborist.reify
Issue #22872 reports the write tool hanging indefinitely after a file is written. Two underlying causes, both in the post-write LSP enrichment path: 1. LSPClient.create wraps the `initialize` request in a 45s withTimeout. If the spawned LSP process is wedged (happens with pyright under certain conditions), every write that matches that LSP blocks the tool for up to 45s even though the file is on disk. 2. Server.spawn for npm-distributed LSPs (pyright, tsserver, biome, ...) calls Npm.which, which internally uses arborist.reify with no timeout. In sandboxed containers with no network access this promise never resolves — the write tool hangs forever. Fix applied at three layers of defense: - write.ts / edit.ts / apply_patch.ts: wrap the touchFile + diagnostics tail in a 5s Effect.timeout with catch-to-empty. Diagnostics are a best-effort enrichment; they must not block the tool's return after the file is already written. - lsp.ts schedule(): bound server.spawn with a 10s Promise.race timeout. On timeout the server is added to s.broken so subsequent touches short-circuit instantly instead of re-racing. - client.ts: lower the `initialize` withTimeout from 45_000 to 10_000. If a server hasn't responded to initialize in 10s it's wedged; 45s was punishing for no benefit. Reproducer tests (added in earlier commits on this branch) now pass: - write-lsp-hang.test.ts (branch A, 45s initialize timeout) - write-lsp-spawn-hang.test.ts (branch B, forever Npm.which) Both complete in ~5s. Full opencode test suite: 1934 pass, 0 fail.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a failing regression test for a related but not identical hang to the one reported in #22872. The reporter's exact scenario almost certainly isn't LSP — see below — but while investigating that issue I surfaced a real, reproducible hang in the same tool (
write) on the same enrichment code path they correctly pointed at. This PR pins that behavior down with a red test; a follow-up will turn it green.What the original report said
From #22872 (reporter: DLME2024):
writetool in OpenCode 1.4.6 hangs indefinitely on any content size"node:20-slim, Anthropic Sonnet 4.6, no LSP configured, pyright not installedPOST /session/$SES/messageasking the model to write/tmp/hello.pytool=write status=running, no output, notime.end, file is not on diskTool.defineEffectconversion that introducedlsp.touchFile/lsp.diagnostics)"Indefinitely" is the reporter's framing — they waited 60s. The LSP
initializetimeout is 45s, so a pure LSP hang would have completed (with an error) before their patience ran out. The decisive detail is "file is not on disk": the write tool callsfs.writeWithDirsatwrite.ts:57, before any LSP enrichment atwrite.ts:67-68. If the file was never written, execution stalled earlier — almost certainly atassertExternalDirectoryEffect(write.ts:40), which fires actx.ask({ permission: "external_directory" })for any path outside the project. In a headless container with no UI/TUI to answer the prompt, that Deferred sits forever./tmp/hello.pyis outside the project, so this fits cleanly.So: the reporter's specific hang is most likely the permission-ask-with-no-listener issue, not LSP.
The issue this PR does reproduce
Even so, while investigating I confirmed with runtime instrumentation (motel traces) that the write tool has a separate, demonstrable hang on LSP enrichment for files that match a configured LSP. Writing a
.pyfile inside a project with pyright available blocks for 45 seconds on theinitializerequest if pyright spawns but doesn't respond. Walking through the chain:lsp.touchFile(filepath, true)→getClients(filepath)(packages/opencode/src/lsp/lsp.ts:225) — walks every registered LSP server, filters by extension, and for each match looks up (or lazily provisions) a client rooted at the nearest config dir.Lazy provisioning calls
server.spawn(root)(lsp.ts:232-243). For pyright (server.ts:484-526) this path does:which("pyright-langserver")— if missing, falls through to…Npm.which("pyright")→arborist.reify()— installs pyright into the opencode cache. No timeout. In a container with restricted network this step can block indefinitely on its own.Spawn succeeds →
LSPClient.create({serverID, server: handle, root})(lsp.ts:248-257) — sends an LSPinitializerequest wrapped inwithTimeout(45_000)(client.ts:82-116). If the server process spawns but never answers (which I reproduced locally with a fresh pyright), the request sits for the full 45 seconds.touchFileis awaited synchronously by the write tool (write.ts:67). Even thoughfs.writeWithDirscompleted atwrite.ts:57, the tool'sEffect.gencan't return its success result until LSP enrichment resolves. The user seestool=write status=runningwith no output — but in this path the file is already on disk.So there are two stacked issues on the same enrichment step:
LSPClient.create'sinitializetimeout is 45s, which is a long time to block a tool on a cosmetic step.Npm.which→arborist.reifyhas no timeout, so constrained-network environments can wait forever.These are strictly a superset of the "works but slow" symptom — on dev machines you hit A; in a container like the reporter's you'd hit B if LSP enrichment is actually the blocker. They don't explain the reporter's "file not on disk" data point, though, which still points at the permission ask.
Motel trace confirming the location of the LSP hang
From a local repro with OTLP export on, debug session
write-hang-22872,.pywrite inside a project:write.execute entry→19:37:10.949Zwrite.assertExternalDirectory done→19:37:10.950Z(1ms — not the hang in this case, since the path was inside the project)write.touchFile begin→19:37:11.192ZLSPClient.create done→elapsedMs=45015 hasClient=false+ERROR ... Operation timed out after 45000ms initialize errorwrite.touchFile done→19:37:56.201ZEntire 45s accounted for inside
touchFile→LSPClient.create, with the file already written 45 seconds earlier.What this test does
packages/opencode/test/fixture/lsp/hanging-lsp-server.js— a fake LSP that swallows every message (includinginitialize) and never replies.packages/opencode/test/tool/write-lsp-hang.test.ts— wires that fake LSP into a tmpdir instance viaopencode.json'slspconfig for a new.hangextension, calls the write tool on a.hangfile, and asserts:result.outputcontains"Wrote file successfully".On
devtoday:The file is written correctly —
expect(content).toBe("print('hi')")passes — but the tool call takes 45s because it waits on the LSPinitializetimeout.Next steps (not in this PR)
Two independent fixes suggested by this investigation, which together close both A and B above:
lsp.touchFile+lsp.diagnosticsin a short timeout at the call site inwrite.ts(and the equivalent inedit.ts/apply_patch.ts). On timeout, return the successful write with empty diagnostics. This is what makes the test in this PR go green.server.spawn+LSPClient.createinsidegetClients'sschedule(...)and add the server tos.brokenon timeout, so every future caller oftouchFilebenefits — not just write.The original reporter's symptom (file never on disk) is a separate bug in the external-directory permission ask — worth a distinct issue and fix, since the LSP timeout above won't help if execution stalls before the file write.