Problem
ConversationTranscriptStoreInterface (post #67) exposes create_session, get_session, update_session, delete_session, get_recent_pending_session, and update_title. There is no single-writer guarantee on a transcript row.
This is a foot-gun for any host that runs:
- long-poll chat where a user message and a runner update can overlap,
- async tool execution that resumes a session asynchronously after an HTTP timeout,
- background agent jobs that share a session with an interactive runner,
- A2A invocations where two chained runners may both try to append to the same parent transcript.
In all four cases two update_session() calls can race, the second overwrites the first wholesale because update_session takes the complete messages array (not a delta), and a turn vanishes silently.
The failure mode is generic across hosts. It is not yet implemented in Extra-Chill/data-machine (verified — no lock / acquire / GET_LOCK references in inc/), but its absence there is a known gap, not evidence that the substrate should skip it.
Proposed shape
Two new methods on the transcript store contract:
/**
* Acquire a single-writer lock on this session.
*
* @param string $session_id Session UUID.
* @param int $ttl_seconds Lock TTL. After expiry the lock is reclaimable.
* @return string|null Lock token to pass back to release_session_lock(),
* or null when contention prevents acquisition.
*/
public function acquire_session_lock( string $session_id, int $ttl_seconds = 300 ): ?string;
/**
* Release a previously acquired lock.
*
* Implementations MUST verify the supplied token matches the active lock
* before releasing — a stale token (after TTL expiry and reacquisition)
* MUST NOT release another runner's lock.
*
* @param string $session_id Session UUID.
* @param string $lock_token Token returned by acquire_session_lock().
* @return bool True on successful release. False on token mismatch or no active lock.
*/
public function release_session_lock( string $session_id, string $lock_token ): bool;
TTL design
TTL-bounded locks (vs. infinite) so a crashed runner cannot freeze a session permanently. 300 seconds is a starting point that comfortably covers the longest reasonable single-turn provider call. Callers that need longer protection can re-acquire mid-turn.
Token semantics
Returning a lock token (vs. boolean) means release_session_lock can refuse to release a lock that has been reclaimed by another runner after TTL expiry. Without the token, the reclaiming runner can have its lock silently released by a stale callback from the original holder.
Composing with update_session
Locks remain advisory. update_session does not enforce that the caller holds the lock — that responsibility stays with AgentConversationLoop (or any other orchestrator), which acquires before its turn-runner dispatch and releases after transcript_persister writes. Forcing every store to enforce lock ownership at the data-access layer would couple persistence to orchestration policy.
Open question — interface placement
Two reasonable shapes:
A) Add to ConversationTranscriptStoreInterface directly.
Pros: every transcript store has the lock primitive available, no separate adapter wiring.
Cons: stores that genuinely cannot lock (e.g. a write-only audit log or a third-party API-backed store) have to stub them with a no-op, which is a quiet correctness regression.
B) Sibling interface ConversationTranscriptLockInterface.
Pros: stores opt in by implementing both. Composition makes "this store does not lock" a type-checkable property.
Cons: orchestrators have to feature-detect, and the option for hosts to silently use a non-locking store is itself a foot-gun.
Leaning toward (B), but want input.
Acceptance criteria
- The substrate exposes the two methods with a documented contract about TTL semantics, token verification, and orchestration-vs-store responsibility for honoring the lock.
- A reference no-op implementation exists for adopters that don't need locking.
AgentConversationLoop consults the lock primitive (when available) around the turn-runner + transcript persistence sequence.
- Smoke tests cover: acquire-then-release happy path, contention (second acquire returns null), TTL expiry + reacquisition, stale-token release rejection.
AI assistance
- AI assistance: Yes
- Tool(s): Claude Code (Opus 4.7)
- Used for: Auditing
ConversationTranscriptStoreInterface and AgentConversationLoop to identify the missing single-writer primitive and drafting the contract proposal.
Problem
ConversationTranscriptStoreInterface(post #67) exposescreate_session,get_session,update_session,delete_session,get_recent_pending_session, andupdate_title. There is no single-writer guarantee on a transcript row.This is a foot-gun for any host that runs:
In all four cases two
update_session()calls can race, the second overwrites the first wholesale becauseupdate_sessiontakes the complete messages array (not a delta), and a turn vanishes silently.The failure mode is generic across hosts. It is not yet implemented in Extra-Chill/data-machine (verified — no
lock/acquire/GET_LOCKreferences ininc/), but its absence there is a known gap, not evidence that the substrate should skip it.Proposed shape
Two new methods on the transcript store contract:
TTL design
TTL-bounded locks (vs. infinite) so a crashed runner cannot freeze a session permanently. 300 seconds is a starting point that comfortably covers the longest reasonable single-turn provider call. Callers that need longer protection can re-acquire mid-turn.
Token semantics
Returning a lock token (vs. boolean) means
release_session_lockcan refuse to release a lock that has been reclaimed by another runner after TTL expiry. Without the token, the reclaiming runner can have its lock silently released by a stale callback from the original holder.Composing with
update_sessionLocks remain advisory.
update_sessiondoes not enforce that the caller holds the lock — that responsibility stays withAgentConversationLoop(or any other orchestrator), which acquires before its turn-runner dispatch and releases aftertranscript_persisterwrites. Forcing every store to enforce lock ownership at the data-access layer would couple persistence to orchestration policy.Open question — interface placement
Two reasonable shapes:
A) Add to
ConversationTranscriptStoreInterfacedirectly.Pros: every transcript store has the lock primitive available, no separate adapter wiring.
Cons: stores that genuinely cannot lock (e.g. a write-only audit log or a third-party API-backed store) have to stub them with a no-op, which is a quiet correctness regression.
B) Sibling interface
ConversationTranscriptLockInterface.Pros: stores opt in by implementing both. Composition makes "this store does not lock" a type-checkable property.
Cons: orchestrators have to feature-detect, and the option for hosts to silently use a non-locking store is itself a foot-gun.
Leaning toward (B), but want input.
Acceptance criteria
AgentConversationLoopconsults the lock primitive (when available) around the turn-runner + transcript persistence sequence.AI assistance
ConversationTranscriptStoreInterfaceandAgentConversationLoopto identify the missing single-writer primitive and drafting the contract proposal.