Skip to content

Add conversation lock/unlock to ConversationTranscriptStoreInterface #74

@lezama

Description

@lezama

Problem

ConversationTranscriptStoreInterface (post #67) exposes create_session, get_session, update_session, delete_session, get_recent_pending_session, and update_title. There is no single-writer guarantee on a transcript row.

This is a foot-gun for any host that runs:

  • long-poll chat where a user message and a runner update can overlap,
  • async tool execution that resumes a session asynchronously after an HTTP timeout,
  • background agent jobs that share a session with an interactive runner,
  • A2A invocations where two chained runners may both try to append to the same parent transcript.

In all four cases two update_session() calls can race, the second overwrites the first wholesale because update_session takes the complete messages array (not a delta), and a turn vanishes silently.

The failure mode is generic across hosts. It is not yet implemented in Extra-Chill/data-machine (verified — no lock / acquire / GET_LOCK references in inc/), but its absence there is a known gap, not evidence that the substrate should skip it.

Proposed shape

Two new methods on the transcript store contract:

/**
 * Acquire a single-writer lock on this session.
 *
 * @param string $session_id Session UUID.
 * @param int    $ttl_seconds Lock TTL. After expiry the lock is reclaimable.
 * @return string|null Lock token to pass back to release_session_lock(),
 *                     or null when contention prevents acquisition.
 */
public function acquire_session_lock( string $session_id, int $ttl_seconds = 300 ): ?string;

/**
 * Release a previously acquired lock.
 *
 * Implementations MUST verify the supplied token matches the active lock
 * before releasing — a stale token (after TTL expiry and reacquisition)
 * MUST NOT release another runner's lock.
 *
 * @param string $session_id Session UUID.
 * @param string $lock_token Token returned by acquire_session_lock().
 * @return bool True on successful release. False on token mismatch or no active lock.
 */
public function release_session_lock( string $session_id, string $lock_token ): bool;

TTL design

TTL-bounded locks (vs. infinite) so a crashed runner cannot freeze a session permanently. 300 seconds is a starting point that comfortably covers the longest reasonable single-turn provider call. Callers that need longer protection can re-acquire mid-turn.

Token semantics

Returning a lock token (vs. boolean) means release_session_lock can refuse to release a lock that has been reclaimed by another runner after TTL expiry. Without the token, the reclaiming runner can have its lock silently released by a stale callback from the original holder.

Composing with update_session

Locks remain advisory. update_session does not enforce that the caller holds the lock — that responsibility stays with AgentConversationLoop (or any other orchestrator), which acquires before its turn-runner dispatch and releases after transcript_persister writes. Forcing every store to enforce lock ownership at the data-access layer would couple persistence to orchestration policy.

Open question — interface placement

Two reasonable shapes:

A) Add to ConversationTranscriptStoreInterface directly.
Pros: every transcript store has the lock primitive available, no separate adapter wiring.
Cons: stores that genuinely cannot lock (e.g. a write-only audit log or a third-party API-backed store) have to stub them with a no-op, which is a quiet correctness regression.

B) Sibling interface ConversationTranscriptLockInterface.
Pros: stores opt in by implementing both. Composition makes "this store does not lock" a type-checkable property.
Cons: orchestrators have to feature-detect, and the option for hosts to silently use a non-locking store is itself a foot-gun.

Leaning toward (B), but want input.

Acceptance criteria

  • The substrate exposes the two methods with a documented contract about TTL semantics, token verification, and orchestration-vs-store responsibility for honoring the lock.
  • A reference no-op implementation exists for adopters that don't need locking.
  • AgentConversationLoop consults the lock primitive (when available) around the turn-runner + transcript persistence sequence.
  • Smoke tests cover: acquire-then-release happy path, contention (second acquire returns null), TTL expiry + reacquisition, stale-token release rejection.

AI assistance

  • AI assistance: Yes
  • Tool(s): Claude Code (Opus 4.7)
  • Used for: Auditing ConversationTranscriptStoreInterface and AgentConversationLoop to identify the missing single-writer primitive and drafting the contract proposal.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions