Problem
We don't measure how often agentmemory injects the right memory but the agent fails to use it. Without this metric, "improve retrieval" and "improve reader behavior" are indistinguishable failure modes. Every retrieval-quality regression looks identical to a reader-behavior regression.
There's a category of failure where retrieval is correct (the answer-bearing memory is in the injected context block) but the agent's response doesn't reflect it — either because the context block was too long, ranking buried the relevant entry, or the agent's prompt didn't lean on the memory. Today this failure mode is invisible.
Proposed shape
New diagnostic counter: reader_failure_with_evidence — counts cases where the agent's response misses the answer despite the answer-bearing memory being in the retrieved context.
This requires:
- A ground-truth signal for "what answer was correct" → only available in the benchmark harness, not in live use
- A judgment of "did the agent's response use the memory" → judge model evaluation
In live use, a weaker proxy is possible: count cases where the agent calls smart-search, receives results, then calls smart-search again with a different query within N seconds. That's a heuristic signal for "first results didn't satisfy."
iii composition:
- New OTEL counter
agentmemory.smart_search.followup_within_window_total (live-use proxy)
- New OTEL counter
agentmemory.reader_failure_with_evidence_total (benchmark-mode only)
- iii function
mem::diagnostic::record-followup increments the proxy counter when called twice within AGENTMEMORY_FOLLOWUP_WINDOW_SECONDS (default 30s)
- Benchmark harness emits
reader_failure_with_evidence from its scorer when it detects the gold-memory-in-context-but-wrong-answer pattern
Heuristic for live-use proxy
When mem::smart-search is called:
- Lookup
mem:recent-searches scope keyed by sessionId for the most recent search timestamp
- If found and within window AND result-set has zero overlap with current result-set (different queries returning different docs), increment
followup_within_window_total
- Write current call to
mem:recent-searches
This proxy will overcount (user genuinely refining their query is not a reader failure) but undercounts the inverse. Treat as a directional signal, not absolute.
Benchmark-mode metric
In benchmark harness, after judge scoring, walk each question:
- If
judge_correct == false AND gold-evidence-IDs ⊆ retrieved-context-IDs → increment reader_failure_with_evidence
- Else if
judge_correct == false AND gold ⊄ retrieved → that's retrieval-failure, NOT reader-failure
- Report both numbers in
scores.json
Edge cases
- Smart-search called by humans via the viewer — viewer queries shouldn't count as reader failures (no agent involved). Tag viewer-originated searches with header
X-Agentmemory-Source: viewer and skip the metric.
- Single-session re-query is normal — refining "auth flow" → "auth token expiry" is research behavior. The proxy will overcount. Document this in the metric's help-text.
- Followup window tuning — 30s default. Configurable. Document that long values overcount, short values undercount.
- Cold start — fresh
mem:recent-searches scope, no prior search. First call is never a followup. Correct.
- Storage retention —
mem:recent-searches only needs last entry per session. TTL via cron sweep, retain last 24h per session.
Acceptance
Why it matters
Disambiguates retrieval bugs from reader bugs. Today we can only measure "user complained that recall didn't work" which is too coarse. Both metrics let us see whether a change to ranking helped retrieval (followup rate down) or just shifted which questions hit the reader-failure floor. Pairs with benchmark harness — without this, benchmark wins might be hollow if they're just moving questions between failure modes.
Problem
We don't measure how often agentmemory injects the right memory but the agent fails to use it. Without this metric, "improve retrieval" and "improve reader behavior" are indistinguishable failure modes. Every retrieval-quality regression looks identical to a reader-behavior regression.
There's a category of failure where retrieval is correct (the answer-bearing memory is in the injected context block) but the agent's response doesn't reflect it — either because the context block was too long, ranking buried the relevant entry, or the agent's prompt didn't lean on the memory. Today this failure mode is invisible.
Proposed shape
New diagnostic counter:
reader_failure_with_evidence— counts cases where the agent's response misses the answer despite the answer-bearing memory being in the retrieved context.This requires:
In live use, a weaker proxy is possible: count cases where the agent calls smart-search, receives results, then calls smart-search again with a different query within N seconds. That's a heuristic signal for "first results didn't satisfy."
iii composition:
agentmemory.smart_search.followup_within_window_total(live-use proxy)agentmemory.reader_failure_with_evidence_total(benchmark-mode only)mem::diagnostic::record-followupincrements the proxy counter when called twice withinAGENTMEMORY_FOLLOWUP_WINDOW_SECONDS(default 30s)reader_failure_with_evidencefrom its scorer when it detects the gold-memory-in-context-but-wrong-answer patternHeuristic for live-use proxy
When
mem::smart-searchis called:mem:recent-searchesscope keyed bysessionIdfor the most recent search timestampfollowup_within_window_totalmem:recent-searchesThis proxy will overcount (user genuinely refining their query is not a reader failure) but undercounts the inverse. Treat as a directional signal, not absolute.
Benchmark-mode metric
In benchmark harness, after judge scoring, walk each question:
judge_correct == falseAND gold-evidence-IDs ⊆ retrieved-context-IDs → incrementreader_failure_with_evidencejudge_correct == falseAND gold ⊄ retrieved → that's retrieval-failure, NOT reader-failurescores.jsonEdge cases
X-Agentmemory-Source: viewerand skip the metric.mem:recent-searchesscope, no prior search. First call is never a followup. Correct.mem:recent-searchesonly needs last entry per session. TTL via cron sweep, retain last 24h per session.Acceptance
mem::diagnostic::record-followupfunction + wired intomem::smart-searchmem:recent-searchesKV scope + cron TTL sweepscores.jsonseparately reports the two failure modesagentmemory statussurfaces followup rate (with caveat in help text)AGENTMEMORY_FOLLOWUP_WINDOW_SECONDSconfigurableWhy it matters
Disambiguates retrieval bugs from reader bugs. Today we can only measure "user complained that recall didn't work" which is too coarse. Both metrics let us see whether a change to ranking helped retrieval (followup rate down) or just shifted which questions hit the reader-failure floor. Pairs with benchmark harness — without this, benchmark wins might be hollow if they're just moving questions between failure modes.