Skip to content

maxpetrusenko/AdaptiveAgent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Adaptive Agent

Python TypeScript Next.js FastAPI LangGraph License

Self-improving AI agent with an eval → feedback → update loop. Inspired by Karpathy's autoresearch and the SICA paper.


An agent that gets better at its job automatically. Run evals, detect failures, generate improved prompts, re-eval, and accept only what improves metrics. The full loop runs from a single "Improve" button in the UI.

User Input → Agent (LangGraph) → Output → Eval Layer → Failure Detection
                                                ↓
                                    Test Case Generation
                                                ↓
                                    Prompt Update (LLM-generated)
                                                ↓
                                    Re-eval → Accept / Reject

Quick Start

# Clone
git clone https://github.com/maxpetrusenko/AdaptiveAgent.git
cd AdaptiveAgent

# Backend
cd backend
pip install -e ".[dev]"
cp .env.example .env          # add OPENAI_API_KEY or ANTHROPIC_API_KEY
uvicorn app.main:app --reload # http://localhost:8000

# Frontend (new terminal)
cd frontend
pnpm install
pnpm dev                      # http://localhost:3737

Open http://localhost:3737. The database and 10 seed eval cases are created automatically on first run.


Benchmark It

What the current benchmark story proves:

  • the adaptive loop improves a weak starting prompt when there is a real failure signal
  • accepted adaptations create a new active prompt only after measured improvement
  • saturated suites stay stable instead of forcing pointless prompt churn
  • the adaptive agent can catch up to strong tool-using baselines on the smoke suite

What it does not prove yet:

  • state-of-the-art performance against external agent products
  • broad statistical significance across large public benchmarks
  • code-level self-modification beyond prompt updates

Two benchmark modes:

  1. single-system adaptation benchmark
  2. comparative leaderboard against baselines

Single-system:

cd backend
python -m app.benchmarks.run --repeats 3 --out benchmark-results/latest.json

Comparative leaderboard:

cd backend
python -m app.benchmarks.compare --out benchmark-results/compare.json

Human-readable storyboard:

python -m app.benchmarks.report_html --dir benchmark-results
open benchmark-results/index.html

If the seeded suite is already at 100% and you want to prove the loop can recover from a weak prompt, run the stress benchmark:

python -m app.benchmarks.run \
  --stress-baseline tool-agnostic \
  --case-tag tool-use \
  --repeats 1 \
  --consistency-repeats 0 \
  --out benchmark-results/stress-tool-use.json

The report includes:

  • baseline mean/std pass rate
  • post-adaptation mean/std pass rate
  • accepted/rejected adaptation decision
  • whether the active prompt version changed

The comparative report includes:

  • direct_llm vs weak_static_agent vs adaptive_agent vs seed_tool_agent vs sdk_tool_agent
  • 8 train cases and 42 held-out eval cases across tool use, reasoning, factual recall, safety, uncertainty, privacy, retrieval, prompt-injection, and multi-turn behavior
  • average latency
  • hallucination failures
  • pairwise win/loss/tie deltas against adaptive_agent
  • judge calibration on 56 labeled cases
  • adversarial null-agent and judge-bias checks

See docs/runbooks/benchmarking.md for interpretation.

Comparative Benchmark — 5 systems, live on Gemma 4

Benchmark Leaderboard


Screenshots

Dashboard — Monitor agent metrics in real-time

Dashboard

Chat — Streaming conversations with tool use

Chat

Evaluations — Run evals, see pass/fail rates and trends

Evals

Test Cases — 10 seed cases + create your own

Cases

Adaptation — One-click self-improvement with prompt diff view

Adapt


What It Does

The Self-Improving Loop

The core loop follows the Karpathy autoresearch pattern — only accept changes that measurably improve performance:

  1. Eval — Run all test cases against the current agent prompt
  2. Detect — LLM-as-judge identifies failures (wrong answers, hallucinations)
  3. Generate — LLM analyzes failure patterns and writes an improved system prompt
  4. Re-eval — Run the same test suite with the new prompt
  5. Accept/Reject — If pass rate improved → keep new prompt. If not → revert.

Every prompt version is stored with full history and rollback capability.

Key Features

  • Chat — Streaming conversations with tool use (calculator, time) via LangGraph
  • Evals — Run evaluation suites with pass/fail, hallucination detection, and consistency checks
  • Cases — 10 seed test cases + create your own + auto-generate from failures
  • Adaptation — One-click self-improvement loop with before/after diff view
  • Dashboard — Live metrics: pass rate, hallucination rate, cost, trends over time

Architecture

frontend/                       backend/
├── Next.js 16 (App Router)     ├── FastAPI + SQLAlchemy + SQLite
├── shadcn/ui + Tailwind        ├── LangGraph agent with tools
├── Recharts for metrics        ├── LLM-as-judge evaluation
├── SSE streaming               ├── Prompt versioning + rollback
└── 5 pages, 12 components      └── Self-improving loop orchestrator

Tech Stack

Layer Tech
Frontend Next.js 16, Tailwind CSS, shadcn/ui, Recharts, react-markdown
Backend Python 3.11, FastAPI, LangGraph, SQLAlchemy, SQLite
Agent OpenAI / Anthropic / OpenAI-compatible local proxy, tool calling, SSE streaming
Eval deterministic checks first, LLM-as-judge fallback, hallucination detection, consistency checks
Testing Vitest (frontend), pytest (backend), 45 backend tests currently

Key Paths

backend/
├── app/agent/graph.py          # LangGraph agent definition
├── app/agent/prompts.py        # System prompt (v1 seed)
├── app/eval/runner.py          # Eval execution engine
├── app/eval/checks.py          # Pass/fail, hallucination, consistency
├── app/adapt/loop.py           # Self-improving loop orchestrator
├── app/adapt/prompt_updater.py # LLM-based prompt improvement
├── app/memory/store.py         # Failure storage
├── app/memory/cases.py         # Failure → test case conversion
├── app/models.py               # All SQLAlchemy models
├── app/seed.py                 # 10 seed eval cases + prompt v1
└── app/api/                    # REST endpoints (chat, evals, cases, adapt, dashboard)

frontend/
├── src/app/page.tsx            # Dashboard with live metrics
├── src/app/chat/page.tsx       # Chat interface with SSE streaming
├── src/app/evals/page.tsx      # Eval runs + results + charts
├── src/app/cases/page.tsx      # Test case management
├── src/app/adapt/page.tsx      # Adaptation history + prompt diff
├── src/hooks/use-chat.ts       # Chat state + streaming hook
└── src/components/             # Chat, evals, cases, adapt, layout

API Endpoints

Method Path Description
GET /health Health check
POST /api/chat/sessions Create chat session
GET /api/chat/sessions List sessions
POST /api/chat/stream Stream agent response (SSE)
GET /api/cases List eval test cases
POST /api/cases Create test case
POST /api/evals/run Trigger eval run
GET /api/evals/runs List eval runs
GET /api/evals/runs/:id/results Get eval results
POST /api/adapt/improve Trigger self-improving loop
GET /api/adapt/runs List adaptation runs
GET /api/adapt/runs/:id Adaptation detail + prompt diff
GET /api/adapt/prompts List prompt versions
GET /api/dashboard/metrics Dashboard metrics

Data Models

Session          → Messages (chat history)
PromptVersion    → versioned system prompts with parent chain
EvalCase         → test inputs + expected outputs + tags
EvalRun          → execution of all cases against a prompt version
EvalResult       → per-case pass/fail + score + latency
AdaptationRun    → before/after prompt versions + pass rates + accepted?

Design Decisions

  • SSE over WebSocket — simpler, HTTP/2 compatible, matches Anthropic's streaming API
  • SQLite — zero-config for MVP, single file, easy to inspect with DB Browser
  • LLM-as-judge fallback — deterministic checks first; configured judge model handles qualitative checks
  • Accept/reject gate — autoresearch pattern: never deploy a regression
  • Prompt versioning — every change tracked, full rollback, diff view in UI

Research References

Built on ideas from:

Key insight: strengthen the verifier, not the generator. If your eval layer is weak, the improvement loop diverges.


Next Steps

  • Add more tools (web search, code interpreter, RAG)
  • Implement consistency checking (multi-run variance)
  • DSPy-style prompt compilation (MIPROv2 optimizer)
  • Fine-tuning path (v2 adaptation beyond prompt updates)
  • Playwright e2e tests for full UI flows
  • OpenTelemetry tracing for agent observability

License

MIT

About

Self-improving AI agent with eval → feedback → update loop

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors