feat(diag): reader-failure-with-evidence counter + smart-search followup-rate proxy

## Problem

We don't measure how often agentmemory injects the right memory but the agent fails to use it. Without this metric, "improve retrieval" and "improve reader behavior" are indistinguishable failure modes. Every retrieval-quality regression looks identical to a reader-behavior regression.

There's a category of failure where retrieval is correct (the answer-bearing memory is in the injected context block) but the agent's response doesn't reflect it — either because the context block was too long, ranking buried the relevant entry, or the agent's prompt didn't lean on the memory. Today this failure mode is invisible.

## Proposed shape

New diagnostic counter: `reader_failure_with_evidence` — counts cases where the agent's response misses the answer despite the answer-bearing memory being in the retrieved context.

This requires:
- A ground-truth signal for "what answer was correct" → only available in the benchmark harness, not in live use
- A judgment of "did the agent's response use the memory" → judge model evaluation

In live use, a weaker proxy is possible: count cases where the agent calls smart-search, receives results, then calls smart-search again with a different query within N seconds. That's a heuristic signal for "first results didn't satisfy."

iii composition:
- New OTEL counter `agentmemory.smart_search.followup_within_window_total` (live-use proxy)
- New OTEL counter `agentmemory.reader_failure_with_evidence_total` (benchmark-mode only)
- iii function `mem::diagnostic::record-followup` increments the proxy counter when called twice within `AGENTMEMORY_FOLLOWUP_WINDOW_SECONDS` (default 30s)
- Benchmark harness emits `reader_failure_with_evidence` from its scorer when it detects the gold-memory-in-context-but-wrong-answer pattern

## Heuristic for live-use proxy

When `mem::smart-search` is called:
- Lookup `mem:recent-searches` scope keyed by `sessionId` for the most recent search timestamp
- If found and within window AND result-set has zero overlap with current result-set (different queries returning different docs), increment `followup_within_window_total`
- Write current call to `mem:recent-searches`

This proxy will overcount (user genuinely refining their query is not a reader failure) but undercounts the inverse. Treat as a directional signal, not absolute.

## Benchmark-mode metric

In benchmark harness, after judge scoring, walk each question:
- If `judge_correct == false` AND gold-evidence-IDs ⊆ retrieved-context-IDs → increment `reader_failure_with_evidence`
- Else if `judge_correct == false` AND gold ⊄ retrieved → that's retrieval-failure, NOT reader-failure
- Report both numbers in `scores.json`

## Edge cases

- **Smart-search called by humans via the viewer** — viewer queries shouldn't count as reader failures (no agent involved). Tag viewer-originated searches with header `X-Agentmemory-Source: viewer` and skip the metric.
- **Single-session re-query is normal** — refining "auth flow" → "auth token expiry" is research behavior. The proxy will overcount. Document this in the metric's help-text.
- **Followup window tuning** — 30s default. Configurable. Document that long values overcount, short values undercount.
- **Cold start** — fresh `mem:recent-searches` scope, no prior search. First call is never a followup. Correct.
- **Storage retention** — `mem:recent-searches` only needs last entry per session. TTL via cron sweep, retain last 24h per session.

## Acceptance

- [ ] OTEL counters registered
- [ ] `mem::diagnostic::record-followup` function + wired into `mem::smart-search`
- [ ] `mem:recent-searches` KV scope + cron TTL sweep
- [ ] Viewer searches tagged with header + excluded
- [ ] Benchmark scorer differentiates retrieval-fail vs reader-fail-with-evidence
- [ ] `scores.json` separately reports the two failure modes
- [ ] `agentmemory status` surfaces followup rate (with caveat in help text)
- [ ] `AGENTMEMORY_FOLLOWUP_WINDOW_SECONDS` configurable
- [ ] Tests: positive followup detection, viewer-excluded path, TTL sweep

## Why it matters

Disambiguates retrieval bugs from reader bugs. Today we can only measure "user complained that recall didn't work" which is too coarse. Both metrics let us see whether a change to ranking helped retrieval (followup rate down) or just shifted which questions hit the reader-failure floor. Pairs with benchmark harness — without this, benchmark wins might be hollow if they're just moving questions between failure modes.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(diag): reader-failure-with-evidence counter + smart-search followup-rate proxy #771

Problem

Proposed shape

Heuristic for live-use proxy

Benchmark-mode metric

Edge cases

Acceptance

Why it matters

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat(diag): reader-failure-with-evidence counter + smart-search followup-rate proxy #771

Description

Problem

Proposed shape

Heuristic for live-use proxy

Benchmark-mode metric

Edge cases

Acceptance

Why it matters

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions