Skip to content

feat(diag): reader-failure-with-evidence counter + smart-search followup-rate proxy #771

@rohitg00

Description

@rohitg00

Problem

We don't measure how often agentmemory injects the right memory but the agent fails to use it. Without this metric, "improve retrieval" and "improve reader behavior" are indistinguishable failure modes. Every retrieval-quality regression looks identical to a reader-behavior regression.

There's a category of failure where retrieval is correct (the answer-bearing memory is in the injected context block) but the agent's response doesn't reflect it — either because the context block was too long, ranking buried the relevant entry, or the agent's prompt didn't lean on the memory. Today this failure mode is invisible.

Proposed shape

New diagnostic counter: reader_failure_with_evidence — counts cases where the agent's response misses the answer despite the answer-bearing memory being in the retrieved context.

This requires:

  • A ground-truth signal for "what answer was correct" → only available in the benchmark harness, not in live use
  • A judgment of "did the agent's response use the memory" → judge model evaluation

In live use, a weaker proxy is possible: count cases where the agent calls smart-search, receives results, then calls smart-search again with a different query within N seconds. That's a heuristic signal for "first results didn't satisfy."

iii composition:

  • New OTEL counter agentmemory.smart_search.followup_within_window_total (live-use proxy)
  • New OTEL counter agentmemory.reader_failure_with_evidence_total (benchmark-mode only)
  • iii function mem::diagnostic::record-followup increments the proxy counter when called twice within AGENTMEMORY_FOLLOWUP_WINDOW_SECONDS (default 30s)
  • Benchmark harness emits reader_failure_with_evidence from its scorer when it detects the gold-memory-in-context-but-wrong-answer pattern

Heuristic for live-use proxy

When mem::smart-search is called:

  • Lookup mem:recent-searches scope keyed by sessionId for the most recent search timestamp
  • If found and within window AND result-set has zero overlap with current result-set (different queries returning different docs), increment followup_within_window_total
  • Write current call to mem:recent-searches

This proxy will overcount (user genuinely refining their query is not a reader failure) but undercounts the inverse. Treat as a directional signal, not absolute.

Benchmark-mode metric

In benchmark harness, after judge scoring, walk each question:

  • If judge_correct == false AND gold-evidence-IDs ⊆ retrieved-context-IDs → increment reader_failure_with_evidence
  • Else if judge_correct == false AND gold ⊄ retrieved → that's retrieval-failure, NOT reader-failure
  • Report both numbers in scores.json

Edge cases

  • Smart-search called by humans via the viewer — viewer queries shouldn't count as reader failures (no agent involved). Tag viewer-originated searches with header X-Agentmemory-Source: viewer and skip the metric.
  • Single-session re-query is normal — refining "auth flow" → "auth token expiry" is research behavior. The proxy will overcount. Document this in the metric's help-text.
  • Followup window tuning — 30s default. Configurable. Document that long values overcount, short values undercount.
  • Cold start — fresh mem:recent-searches scope, no prior search. First call is never a followup. Correct.
  • Storage retentionmem:recent-searches only needs last entry per session. TTL via cron sweep, retain last 24h per session.

Acceptance

  • OTEL counters registered
  • mem::diagnostic::record-followup function + wired into mem::smart-search
  • mem:recent-searches KV scope + cron TTL sweep
  • Viewer searches tagged with header + excluded
  • Benchmark scorer differentiates retrieval-fail vs reader-fail-with-evidence
  • scores.json separately reports the two failure modes
  • agentmemory status surfaces followup rate (with caveat in help text)
  • AGENTMEMORY_FOLLOWUP_WINDOW_SECONDS configurable
  • Tests: positive followup detection, viewer-excluded path, TTL sweep

Why it matters

Disambiguates retrieval bugs from reader bugs. Today we can only measure "user complained that recall didn't work" which is too coarse. Both metrics let us see whether a change to ranking helped retrieval (followup rate down) or just shifted which questions hit the reader-failure floor. Pairs with benchmark harness — without this, benchmark wins might be hollow if they're just moving questions between failure modes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions