Self-improving AI agent with an eval → feedback → update loop. Inspired by Karpathy's autoresearch and the SICA paper.
An agent that gets better at its job automatically. Run evals, detect failures, generate improved prompts, re-eval, and accept only what improves metrics. The full loop runs from a single "Improve" button in the UI.
User Input → Agent (LangGraph) → Output → Eval Layer → Failure Detection
↓
Test Case Generation
↓
Prompt Update (LLM-generated)
↓
Re-eval → Accept / Reject
# Clone
git clone https://github.com/maxpetrusenko/AdaptiveAgent.git
cd AdaptiveAgent
# Backend
cd backend
pip install -e ".[dev]"
cp .env.example .env # add OPENAI_API_KEY or ANTHROPIC_API_KEY
uvicorn app.main:app --reload # http://localhost:8000
# Frontend (new terminal)
cd frontend
pnpm install
pnpm dev # http://localhost:3737Open http://localhost:3737. The database and 10 seed eval cases are created automatically on first run.
What the current benchmark story proves:
- the adaptive loop improves a weak starting prompt when there is a real failure signal
- accepted adaptations create a new active prompt only after measured improvement
- saturated suites stay stable instead of forcing pointless prompt churn
- the adaptive agent can catch up to strong tool-using baselines on the smoke suite
What it does not prove yet:
- state-of-the-art performance against external agent products
- broad statistical significance across large public benchmarks
- code-level self-modification beyond prompt updates
Two benchmark modes:
- single-system adaptation benchmark
- comparative leaderboard against baselines
Single-system:
cd backend
python -m app.benchmarks.run --repeats 3 --out benchmark-results/latest.jsonComparative leaderboard:
cd backend
python -m app.benchmarks.compare --out benchmark-results/compare.jsonHuman-readable storyboard:
python -m app.benchmarks.report_html --dir benchmark-results
open benchmark-results/index.htmlIf the seeded suite is already at 100% and you want to prove the loop can recover from a weak prompt, run the stress benchmark:
python -m app.benchmarks.run \
--stress-baseline tool-agnostic \
--case-tag tool-use \
--repeats 1 \
--consistency-repeats 0 \
--out benchmark-results/stress-tool-use.jsonThe report includes:
- baseline mean/std pass rate
- post-adaptation mean/std pass rate
- accepted/rejected adaptation decision
- whether the active prompt version changed
The comparative report includes:
direct_llmvsweak_static_agentvsadaptive_agentvsseed_tool_agentvssdk_tool_agent- 8 train cases and 42 held-out eval cases across tool use, reasoning, factual recall, safety, uncertainty, privacy, retrieval, prompt-injection, and multi-turn behavior
- average latency
- hallucination failures
- pairwise win/loss/tie deltas against
adaptive_agent - judge calibration on 56 labeled cases
- adversarial null-agent and judge-bias checks
See docs/runbooks/benchmarking.md for interpretation.
The core loop follows the Karpathy autoresearch pattern — only accept changes that measurably improve performance:
- Eval — Run all test cases against the current agent prompt
- Detect — LLM-as-judge identifies failures (wrong answers, hallucinations)
- Generate — LLM analyzes failure patterns and writes an improved system prompt
- Re-eval — Run the same test suite with the new prompt
- Accept/Reject — If pass rate improved → keep new prompt. If not → revert.
Every prompt version is stored with full history and rollback capability.
- Chat — Streaming conversations with tool use (calculator, time) via LangGraph
- Evals — Run evaluation suites with pass/fail, hallucination detection, and consistency checks
- Cases — 10 seed test cases + create your own + auto-generate from failures
- Adaptation — One-click self-improvement loop with before/after diff view
- Dashboard — Live metrics: pass rate, hallucination rate, cost, trends over time
frontend/ backend/
├── Next.js 16 (App Router) ├── FastAPI + SQLAlchemy + SQLite
├── shadcn/ui + Tailwind ├── LangGraph agent with tools
├── Recharts for metrics ├── LLM-as-judge evaluation
├── SSE streaming ├── Prompt versioning + rollback
└── 5 pages, 12 components └── Self-improving loop orchestrator
| Layer | Tech |
|---|---|
| Frontend | Next.js 16, Tailwind CSS, shadcn/ui, Recharts, react-markdown |
| Backend | Python 3.11, FastAPI, LangGraph, SQLAlchemy, SQLite |
| Agent | OpenAI / Anthropic / OpenAI-compatible local proxy, tool calling, SSE streaming |
| Eval | deterministic checks first, LLM-as-judge fallback, hallucination detection, consistency checks |
| Testing | Vitest (frontend), pytest (backend), 45 backend tests currently |
backend/
├── app/agent/graph.py # LangGraph agent definition
├── app/agent/prompts.py # System prompt (v1 seed)
├── app/eval/runner.py # Eval execution engine
├── app/eval/checks.py # Pass/fail, hallucination, consistency
├── app/adapt/loop.py # Self-improving loop orchestrator
├── app/adapt/prompt_updater.py # LLM-based prompt improvement
├── app/memory/store.py # Failure storage
├── app/memory/cases.py # Failure → test case conversion
├── app/models.py # All SQLAlchemy models
├── app/seed.py # 10 seed eval cases + prompt v1
└── app/api/ # REST endpoints (chat, evals, cases, adapt, dashboard)
frontend/
├── src/app/page.tsx # Dashboard with live metrics
├── src/app/chat/page.tsx # Chat interface with SSE streaming
├── src/app/evals/page.tsx # Eval runs + results + charts
├── src/app/cases/page.tsx # Test case management
├── src/app/adapt/page.tsx # Adaptation history + prompt diff
├── src/hooks/use-chat.ts # Chat state + streaming hook
└── src/components/ # Chat, evals, cases, adapt, layout
| Method | Path | Description |
|---|---|---|
GET |
/health |
Health check |
POST |
/api/chat/sessions |
Create chat session |
GET |
/api/chat/sessions |
List sessions |
POST |
/api/chat/stream |
Stream agent response (SSE) |
GET |
/api/cases |
List eval test cases |
POST |
/api/cases |
Create test case |
POST |
/api/evals/run |
Trigger eval run |
GET |
/api/evals/runs |
List eval runs |
GET |
/api/evals/runs/:id/results |
Get eval results |
POST |
/api/adapt/improve |
Trigger self-improving loop |
GET |
/api/adapt/runs |
List adaptation runs |
GET |
/api/adapt/runs/:id |
Adaptation detail + prompt diff |
GET |
/api/adapt/prompts |
List prompt versions |
GET |
/api/dashboard/metrics |
Dashboard metrics |
Session → Messages (chat history)
PromptVersion → versioned system prompts with parent chain
EvalCase → test inputs + expected outputs + tags
EvalRun → execution of all cases against a prompt version
EvalResult → per-case pass/fail + score + latency
AdaptationRun → before/after prompt versions + pass rates + accepted?
- SSE over WebSocket — simpler, HTTP/2 compatible, matches Anthropic's streaming API
- SQLite — zero-config for MVP, single file, easy to inspect with DB Browser
- LLM-as-judge fallback — deterministic checks first; configured judge model handles qualitative checks
- Accept/reject gate — autoresearch pattern: never deploy a regression
- Prompt versioning — every change tracked, full rollback, diff view in UI
Built on ideas from:
- Karpathy's autoresearch — fixed-budget modify/run/eval/accept loop
- SICA: Self-Improving Coding Agent — agent edits its own scaffolding
- GVU Framework — Generator-Verifier-Updater unifies all self-improvement methods
- LangGraph Reflection Patterns — basic reflection, Reflexion, LATS
- SelfCheckGPT — consistency-based hallucination detection
Key insight: strengthen the verifier, not the generator. If your eval layer is weak, the improvement loop diverges.
- Add more tools (web search, code interpreter, RAG)
- Implement consistency checking (multi-run variance)
- DSPy-style prompt compilation (MIPROv2 optimizer)
- Fine-tuning path (v2 adaptation beyond prompt updates)
- Playwright e2e tests for full UI flows
- OpenTelemetry tracing for agent observability
MIT





