Adaptive Agent

Self-improving AI agent with an eval → feedback → update loop. Inspired by Karpathy's autoresearch and the SICA paper.

An agent that gets better at its job automatically. Run evals, detect failures, generate improved prompts, re-eval, and accept only what improves metrics. The full loop runs from a single "Improve" button in the UI.

User Input → Agent (LangGraph) → Output → Eval Layer → Failure Detection
                                                ↓
                                    Test Case Generation
                                                ↓
                                    Prompt Update (LLM-generated)
                                                ↓
                                    Re-eval → Accept / Reject

Quick Start

# Clone
git clone https://github.com/maxpetrusenko/AdaptiveAgent.git
cd AdaptiveAgent

# Backend
cd backend
pip install -e ".[dev]"
cp .env.example .env          # add OPENAI_API_KEY or ANTHROPIC_API_KEY
uvicorn app.main:app --reload # http://localhost:8000

# Frontend (new terminal)
cd frontend
pnpm install
pnpm dev                      # http://localhost:3737

Open http://localhost:3737. The database and 10 seed eval cases are created automatically on first run.

Benchmark It

What the current benchmark story proves:

the adaptive loop improves a weak starting prompt when there is a real failure signal
accepted adaptations create a new active prompt only after measured improvement
saturated suites stay stable instead of forcing pointless prompt churn
the adaptive agent can catch up to strong tool-using baselines on the smoke suite

What it does not prove yet:

state-of-the-art performance against external agent products
broad statistical significance across large public benchmarks
code-level self-modification beyond prompt updates

Two benchmark modes:

single-system adaptation benchmark
comparative leaderboard against baselines

Single-system:

cd backend
python -m app.benchmarks.run --repeats 3 --out benchmark-results/latest.json

Comparative leaderboard:

cd backend
python -m app.benchmarks.compare --out benchmark-results/compare.json

Human-readable storyboard:

python -m app.benchmarks.report_html --dir benchmark-results
open benchmark-results/index.html

If the seeded suite is already at 100% and you want to prove the loop can recover from a weak prompt, run the stress benchmark:

python -m app.benchmarks.run \
  --stress-baseline tool-agnostic \
  --case-tag tool-use \
  --repeats 1 \
  --consistency-repeats 0 \
  --out benchmark-results/stress-tool-use.json

The report includes:

baseline mean/std pass rate
post-adaptation mean/std pass rate
accepted/rejected adaptation decision
whether the active prompt version changed

The comparative report includes:

direct_llm vs weak_static_agent vs adaptive_agent vs seed_tool_agent vs sdk_tool_agent
8 train cases and 42 held-out eval cases across tool use, reasoning, factual recall, safety, uncertainty, privacy, retrieval, prompt-injection, and multi-turn behavior
average latency
hallucination failures
pairwise win/loss/tie deltas against adaptive_agent
judge calibration on 56 labeled cases
adversarial null-agent and judge-bias checks

See docs/runbooks/benchmarking.md for interpretation.

Comparative Benchmark — 5 systems, live on Gemma 4

Screenshots

Dashboard — Monitor agent metrics in real-time

Chat — Streaming conversations with tool use

Evaluations — Run evals, see pass/fail rates and trends

Test Cases — 10 seed cases + create your own

Adaptation — One-click self-improvement with prompt diff view

What It Does

The Self-Improving Loop

The core loop follows the Karpathy autoresearch pattern — only accept changes that measurably improve performance:

Eval — Run all test cases against the current agent prompt
Detect — LLM-as-judge identifies failures (wrong answers, hallucinations)
Generate — LLM analyzes failure patterns and writes an improved system prompt
Re-eval — Run the same test suite with the new prompt
Accept/Reject — If pass rate improved → keep new prompt. If not → revert.

Every prompt version is stored with full history and rollback capability.

Key Features

Chat — Streaming conversations with tool use (calculator, time) via LangGraph
Evals — Run evaluation suites with pass/fail, hallucination detection, and consistency checks
Cases — 10 seed test cases + create your own + auto-generate from failures
Adaptation — One-click self-improvement loop with before/after diff view
Dashboard — Live metrics: pass rate, hallucination rate, cost, trends over time

Architecture

frontend/                       backend/
├── Next.js 16 (App Router)     ├── FastAPI + SQLAlchemy + SQLite
├── shadcn/ui + Tailwind        ├── LangGraph agent with tools
├── Recharts for metrics        ├── LLM-as-judge evaluation
├── SSE streaming               ├── Prompt versioning + rollback
└── 5 pages, 12 components      └── Self-improving loop orchestrator

Tech Stack

Layer	Tech
Frontend	Next.js 16, Tailwind CSS, shadcn/ui, Recharts, react-markdown
Backend	Python 3.11, FastAPI, LangGraph, SQLAlchemy, SQLite
Agent	OpenAI / Anthropic / OpenAI-compatible local proxy, tool calling, SSE streaming
Eval	deterministic checks first, LLM-as-judge fallback, hallucination detection, consistency checks
Testing	Vitest (frontend), pytest (backend), 45 backend tests currently

Key Paths

backend/
├── app/agent/graph.py          # LangGraph agent definition
├── app/agent/prompts.py        # System prompt (v1 seed)
├── app/eval/runner.py          # Eval execution engine
├── app/eval/checks.py          # Pass/fail, hallucination, consistency
├── app/adapt/loop.py           # Self-improving loop orchestrator
├── app/adapt/prompt_updater.py # LLM-based prompt improvement
├── app/memory/store.py         # Failure storage
├── app/memory/cases.py         # Failure → test case conversion
├── app/models.py               # All SQLAlchemy models
├── app/seed.py                 # 10 seed eval cases + prompt v1
└── app/api/                    # REST endpoints (chat, evals, cases, adapt, dashboard)

frontend/
├── src/app/page.tsx            # Dashboard with live metrics
├── src/app/chat/page.tsx       # Chat interface with SSE streaming
├── src/app/evals/page.tsx      # Eval runs + results + charts
├── src/app/cases/page.tsx      # Test case management
├── src/app/adapt/page.tsx      # Adaptation history + prompt diff
├── src/hooks/use-chat.ts       # Chat state + streaming hook
└── src/components/             # Chat, evals, cases, adapt, layout

API Endpoints

Method	Path	Description
`GET`	`/health`	Health check
`POST`	`/api/chat/sessions`	Create chat session
`GET`	`/api/chat/sessions`	List sessions
`POST`	`/api/chat/stream`	Stream agent response (SSE)
`GET`	`/api/cases`	List eval test cases
`POST`	`/api/cases`	Create test case
`POST`	`/api/evals/run`	Trigger eval run
`GET`	`/api/evals/runs`	List eval runs
`GET`	`/api/evals/runs/:id/results`	Get eval results
`POST`	`/api/adapt/improve`	Trigger self-improving loop
`GET`	`/api/adapt/runs`	List adaptation runs
`GET`	`/api/adapt/runs/:id`	Adaptation detail + prompt diff
`GET`	`/api/adapt/prompts`	List prompt versions
`GET`	`/api/dashboard/metrics`	Dashboard metrics

Data Models

Session          → Messages (chat history)
PromptVersion    → versioned system prompts with parent chain
EvalCase         → test inputs + expected outputs + tags
EvalRun          → execution of all cases against a prompt version
EvalResult       → per-case pass/fail + score + latency
AdaptationRun    → before/after prompt versions + pass rates + accepted?

Design Decisions

SSE over WebSocket — simpler, HTTP/2 compatible, matches Anthropic's streaming API
SQLite — zero-config for MVP, single file, easy to inspect with DB Browser
LLM-as-judge fallback — deterministic checks first; configured judge model handles qualitative checks
Accept/reject gate — autoresearch pattern: never deploy a regression
Prompt versioning — every change tracked, full rollback, diff view in UI

Research References

Built on ideas from:

Karpathy's autoresearch — fixed-budget modify/run/eval/accept loop
SICA: Self-Improving Coding Agent — agent edits its own scaffolding
GVU Framework — Generator-Verifier-Updater unifies all self-improvement methods
LangGraph Reflection Patterns — basic reflection, Reflexion, LATS
SelfCheckGPT — consistency-based hallucination detection

Key insight: strengthen the verifier, not the generator. If your eval layer is weak, the improvement loop diverges.

Next Steps

Add more tools (web search, code interpreter, RAG)
Implement consistency checking (multi-run variance)
DSPy-style prompt compilation (MIPROv2 optimizer)
Fine-tuning path (v2 adaptation beyond prompt updates)
Playwright e2e tests for full UI flows
OpenTelemetry tracing for agent observability

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets		assets
backend		backend
docs		docs
frontend		frontend
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adaptive Agent

Quick Start

Benchmark It

Comparative Benchmark — 5 systems, live on Gemma 4

Screenshots

Dashboard — Monitor agent metrics in real-time

Chat — Streaming conversations with tool use

Evaluations — Run evals, see pass/fail rates and trends

Test Cases — 10 seed cases + create your own

Adaptation — One-click self-improvement with prompt diff view

What It Does

The Self-Improving Loop

Key Features

Architecture

Tech Stack

Key Paths

API Endpoints

Data Models

Design Decisions

Research References

Next Steps

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Adaptive Agent

Quick Start

Benchmark It

Comparative Benchmark — 5 systems, live on Gemma 4

Screenshots

Dashboard — Monitor agent metrics in real-time

Chat — Streaming conversations with tool use

Evaluations — Run evals, see pass/fail rates and trends

Test Cases — 10 seed cases + create your own

Adaptation — One-click self-improvement with prompt diff view

What It Does

The Self-Improving Loop

Key Features

Architecture

Tech Stack

Key Paths

API Endpoints

Data Models

Design Decisions

Research References

Next Steps

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages