Skip to content

Feature Request: Systematic prefix-cache stability — learn from deepseek-reasonix's 99%+ cache hit architecture #2264

@encyc

Description

@encyc

Problem

CodeWhale already has prefix-cache awareness in its DNA — the "volatile-content-last invariant" in prompts.rs, byte-stable assistant message tests in client.rs, and a cache-hit-percent footer chip. But these are best-effort conventions, not a systematic invariant enforced at the architecture level.

deepseek-reasonix takes a different approach: it treats prefix-cache stability as a hard architectural invariant, not a guideline. The result is 85–99%+ cache hit rates in real sessions, translating to ~50× cost reduction on DeepSeek's pricing (¥0.02/1M cached vs ¥1/1M uncached).

What deepseek-reasonix does that CodeWhale could adopt

1. Byte-stable prompt construction as first-class invariant

Reasonix divides every request into three rigid zones that never shift:

  • Prefix (pinned): system prompt, tool definitions, persistent memory — hashed at construction, never mutated mid-session
  • Log (append-only): conversation history — only appended, never reordered or edited
  • Scratch (ephemeral): per-turn metadata — wiped at every turn boundary

Any code path that would mutate the prefix or reorder the log is rejected at the framework level, not caught in review.

This contrasts with CodeWhale's current approach where the volatile-content boundary is documented in comments but not enforced — a stray edit_file to instructions or an unlucky /compact can silently bust the cache.

2. Cross-session cache persistence

Reasonix sessions can be left running — the prefix stays stable across sessions because the pinned zone structure is deterministic. Reopening a session reconstructs the identical prefix from configuration, so the first API call of a new session can still hit the cache if the session was running recently.

3. Cache-first cost visibility

Reasonix surfaces cache economics directly: every turn shows cache hit rate %, estimated cost with/without caching, and cumulative savings. CodeWhale's footer chip (red <40%, yellow <80%) is a good start but doesn't show the cost impact.

Existing CodeWhale foundation (good news — not starting from zero)

Component What exists
prompts.rs:614 Volatile-content-last invariant (documented)
client.rs:1529 Byte-stable assistant message test
client.rs / ui.rs prompt_cache_hit_tokens / prompt_cache_miss_tokens tracked per turn
Footer Cache hit % chip with color thresholds
System prompt Layered most-static-first for DeepSeek KV cache

What's missing

  1. No architectural enforcement — the volatile-content boundary is a comment, not a compile-time or runtime gate
  2. No prefix hashing — we can't detect when the prefix has been mutated and warn
  3. No cross-session prefix reuse — restarting CodeWhale invalidates the entire cache
  4. No cost-equivalent visibility — cache hit % is shown but without translating to actual ¥ saved
  5. /compact busts cache — the compaction relay intentionally rewrites the prefix, and there's no strategy to mitigate the cost

Suggested approach

Phase A — Harden existing invariants (low risk)

  • Add a compile-time check or runtime assertion that system prompt construction is deterministic
  • Warn (footer yellow) when prompt_cache_hit_tokens drops significantly between turns
  • Add a /cache stats command showing cumulative cache savings in ¥

Phase B — Prefix zone enforcement (medium)

  • Formally split prompt construction into PinnedPrefix / AppendLog / TurnScratch zones
  • Hash the pinned prefix at session start — warn if any subsequent turn sends a different prefix
  • Reject code paths that mutate the log instead of appending

Phase C — Cross-session cache persistence (ambitious)

  • When session restarts within DeepSeek's cache TTL (~5-15 min), reconstruct the identical pinned prefix
  • This requires deterministic system prompt generation (no timestamp-dependent blocks in the pinned zone, etc.)

Related

Questions for discussion

  1. Would a formal PinnedPrefix / AppendLog / TurnScratch split be acceptable, or is it too invasive for the prompt construction pipeline?
  2. Cross-session cache persistence requires moving time-dependent blocks (like current date) into the scratch zone — acceptable trade-off?
  3. Should this be a standalone initiative or folded into Feature Request: Proposal a Fourth Mode "Dual" — Pro for Reasoning + Flash for Execution #1676 as a companion cost-saving pillar?

Metadata

Metadata

Assignees

No one assigned

    Labels

    cache-maximalismDeepSeek V4 cache-maximal context and agent architectureenhancementNew feature or request

    Projects

    Status

    In progress

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions