Skip to content

test: reproducer for write tool hanging on slow LSP initialize (related to #22872)#22884

Closed
kitlangton wants to merge 3 commits into
devfrom
kit/repro-write-lsp-hang
Closed

test: reproducer for write tool hanging on slow LSP initialize (related to #22872)#22884
kitlangton wants to merge 3 commits into
devfrom
kit/repro-write-lsp-hang

Conversation

@kitlangton
Copy link
Copy Markdown
Contributor

@kitlangton kitlangton commented Apr 16, 2026

Summary

Adds a failing regression test for a related but not identical hang to the one reported in #22872. The reporter's exact scenario almost certainly isn't LSP — see below — but while investigating that issue I surfaced a real, reproducible hang in the same tool (write) on the same enrichment code path they correctly pointed at. This PR pins that behavior down with a red test; a follow-up will turn it green.

What the original report said

From #22872 (reporter: DLME2024):

  • "write tool in OpenCode 1.4.6 hangs indefinitely on any content size"
  • Environment: Docker node:20-slim, Anthropic Sonnet 4.6, no LSP configured, pyright not installed
  • Repro path: POST /session/$SES/message asking the model to write /tmp/hello.py
  • After 60s: tool=write status=running, no output, no time.end, file is not on disk
  • Suspected refactor(tool): convert write tool to Tool.defineEffect #21901 (the Tool.defineEffect conversion that introduced lsp.touchFile/lsp.diagnostics)

"Indefinitely" is the reporter's framing — they waited 60s. The LSP initialize timeout is 45s, so a pure LSP hang would have completed (with an error) before their patience ran out. The decisive detail is "file is not on disk": the write tool calls fs.writeWithDirs at write.ts:57, before any LSP enrichment at write.ts:67-68. If the file was never written, execution stalled earlier — almost certainly at assertExternalDirectoryEffect (write.ts:40), which fires a ctx.ask({ permission: "external_directory" }) for any path outside the project. In a headless container with no UI/TUI to answer the prompt, that Deferred sits forever. /tmp/hello.py is outside the project, so this fits cleanly.

So: the reporter's specific hang is most likely the permission-ask-with-no-listener issue, not LSP.

The issue this PR does reproduce

Even so, while investigating I confirmed with runtime instrumentation (motel traces) that the write tool has a separate, demonstrable hang on LSP enrichment for files that match a configured LSP. Writing a .py file inside a project with pyright available blocks for 45 seconds on the initialize request if pyright spawns but doesn't respond. Walking through the chain:

  1. lsp.touchFile(filepath, true)getClients(filepath) (packages/opencode/src/lsp/lsp.ts:225) — walks every registered LSP server, filters by extension, and for each match looks up (or lazily provisions) a client rooted at the nearest config dir.

  2. Lazy provisioning calls server.spawn(root) (lsp.ts:232-243). For pyright (server.ts:484-526) this path does:

    • which("pyright-langserver") — if missing, falls through to…
    • Npm.which("pyright")arborist.reify() — installs pyright into the opencode cache. No timeout. In a container with restricted network this step can block indefinitely on its own.
  3. Spawn succeeds → LSPClient.create({serverID, server: handle, root}) (lsp.ts:248-257) — sends an LSP initialize request wrapped in withTimeout(45_000) (client.ts:82-116). If the server process spawns but never answers (which I reproduced locally with a fresh pyright), the request sits for the full 45 seconds.

  4. touchFile is awaited synchronously by the write tool (write.ts:67). Even though fs.writeWithDirs completed at write.ts:57, the tool's Effect.gen can't return its success result until LSP enrichment resolves. The user sees tool=write status=running with no output — but in this path the file is already on disk.

So there are two stacked issues on the same enrichment step:

  • A (always-present 45s ceiling): LSPClient.create's initialize timeout is 45s, which is a long time to block a tool on a cosmetic step.
  • B (unbounded in sandboxed containers): Npm.whicharborist.reify has no timeout, so constrained-network environments can wait forever.

These are strictly a superset of the "works but slow" symptom — on dev machines you hit A; in a container like the reporter's you'd hit B if LSP enrichment is actually the blocker. They don't explain the reporter's "file not on disk" data point, though, which still points at the permission ask.

Motel trace confirming the location of the LSP hang

From a local repro with OTLP export on, debug session write-hang-22872, .py write inside a project:

  • write.execute entry19:37:10.949Z
  • write.assertExternalDirectory done19:37:10.950Z (1ms — not the hang in this case, since the path was inside the project)
  • write.touchFile begin19:37:11.192Z
  • LSPClient.create doneelapsedMs=45015 hasClient=false + ERROR ... Operation timed out after 45000ms initialize error
  • write.touchFile done19:37:56.201Z

Entire 45s accounted for inside touchFileLSPClient.create, with the file already written 45 seconds earlier.

What this test does

  1. Adds packages/opencode/test/fixture/lsp/hanging-lsp-server.js — a fake LSP that swallows every message (including initialize) and never replies.
  2. Adds packages/opencode/test/tool/write-lsp-hang.test.ts — wires that fake LSP into a tmpdir instance via opencode.json's lsp config for a new .hang extension, calls the write tool on a .hang file, and asserts:
    • The file is written correctly on disk.
    • result.output contains "Wrote file successfully".
    • The whole call returns in < 10s.

On dev today:

Expected: < 10000
Received: 45279

The file is written correctly — expect(content).toBe("print('hi')") passes — but the tool call takes 45s because it waits on the LSP initialize timeout.

Next steps (not in this PR)

Two independent fixes suggested by this investigation, which together close both A and B above:

  • Wrap lsp.touchFile + lsp.diagnostics in a short timeout at the call site in write.ts (and the equivalent in edit.ts / apply_patch.ts). On timeout, return the successful write with empty diagnostics. This is what makes the test in this PR go green.
  • Independently, bound server.spawn + LSPClient.create inside getClients's schedule(...) and add the server to s.broken on timeout, so every future caller of touchFile benefits — not just write.

The original reporter's symptom (file never on disk) is a separate bug in the external-directory permission ask — worth a distinct issue and fix, since the LSP timeout above won't help if execution stalls before the file write.

…22872)

Adds a failing regression test that reproduces the write tool hang
reported in #22872. The write tool calls lsp.touchFile + lsp.diagnostics
to enrich its output; if a matching LSP server spawns but never responds
to the initialize request, the tool blocks on LSPClient.create's 45s
withTimeout.

The test configures a fake LSP server (hanging-lsp-server.js) that
swallows every message and never replies, asserts the file is still
written correctly, and checks the tool returns within 10s. On dev today
the assertion fails with ~45s actual, proving the hang. The fix should
make this green by bounding the diagnostic-enrichment tail.
@kitlangton kitlangton changed the title test: reproducer for write tool hanging on slow LSP initialize (#22872) test: reproducer for write tool hanging on slow LSP initialize (related to #22872) Apr 16, 2026
Adds a second reproducer covering the 'forever' branch of issue #22872:
when Pyright.spawn calls Npm.which('pyright') and the npm registry is
unreachable (sandboxed container), arborist.reify blocks indefinitely
with no timeout.

Changes:

- Adds optional Info.spawnEffect alongside the existing async Info.spawn.
  spawnEffect returns an Effect that can yield from Npm.Service, making
  npm lookups injectable for tests.
- Migrates Pyright to use spawnEffect, pulling the venv probing logic
  into a reusable pyrightVenvInitialization helper. The legacy async
  spawn stays for backwards compatibility.
- Threads Npm.Service through LSP.layer so getClients captures a stable
  reference and uses it for any server that provides spawnEffect.
- Adds test/tool/write-lsp-spawn-hang.test.ts — mocks Npm.Service.which
  with Effect.never and asserts the write tool still returns in < 10s.
  Fails today (hangs forever); the fix must bound the touchFile tail
  so the tool cannot wait on a wedged LSP spawn.

The two reproducers now cover both hang branches:
- write-lsp-hang.test.ts: 45s LSPClient.create initialize timeout
- write-lsp-spawn-hang.test.ts: unbounded Npm.which arborist.reify
Issue #22872 reports the write tool hanging indefinitely after a file
is written. Two underlying causes, both in the post-write LSP
enrichment path:

1. LSPClient.create wraps the `initialize` request in a 45s
   withTimeout. If the spawned LSP process is wedged (happens with
   pyright under certain conditions), every write that matches that
   LSP blocks the tool for up to 45s even though the file is on disk.

2. Server.spawn for npm-distributed LSPs (pyright, tsserver,
   biome, ...) calls Npm.which, which internally uses arborist.reify
   with no timeout. In sandboxed containers with no network access
   this promise never resolves — the write tool hangs forever.

Fix applied at three layers of defense:

- write.ts / edit.ts / apply_patch.ts: wrap the touchFile +
  diagnostics tail in a 5s Effect.timeout with catch-to-empty.
  Diagnostics are a best-effort enrichment; they must not block the
  tool's return after the file is already written.
- lsp.ts schedule(): bound server.spawn with a 10s Promise.race
  timeout. On timeout the server is added to s.broken so subsequent
  touches short-circuit instantly instead of re-racing.
- client.ts: lower the `initialize` withTimeout from 45_000 to
  10_000. If a server hasn't responded to initialize in 10s it's
  wedged; 45s was punishing for no benefit.

Reproducer tests (added in earlier commits on this branch) now pass:
- write-lsp-hang.test.ts  (branch A, 45s initialize timeout)
- write-lsp-spawn-hang.test.ts  (branch B, forever Npm.which)
Both complete in ~5s.

Full opencode test suite: 1934 pass, 0 fail.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant