This file provides persistent context for AI agents working on this repository.
- Language:
- Framework:
- Database:
- Testing:
- Keep functions small and focused
- Write self-documenting code with clear names
- Add comments only for "why", not "what"
- Follow existing patterns in the codebase
- Format command:
<!-- e.g., make format, npm run format --> - Lint command:
<!-- e.g., make lint, npm run lint --> - Type check:
<!-- e.g., make typecheck, npm run typecheck -->
- Test command:
<!-- e.g., make test, npm test --> - Coverage requirement:
- Test location:
<!-- e.g., tests/, __tests__/ -->
<!-- Customize this structure for your project -->
.
├── src/ # Source code
├── tests/ # Test files
├── docs/ # Documentation
├── configs/ # Configuration files
├── scripts/ # Utility scripts
├── AGENTS.md # This file
├── WORKFLOW.md # OpenSymphony configuration
└── README.md # Project readme
src/-tests/-docs/-
| Variable | Description | Required |
|---|---|---|
EXAMPLE_VAR |
Yes/No |
# Example setup steps
# 1. Install dependencies
# 2. Configure environment
# 3. Run testsBefore submitting a PR:
- All tests pass
- Code is formatted
- Lint checks pass
- New code has tests
- Documentation updated if needed
- Context:
- Decision:
- Consequences:
The following content was preserved from the repository's previous AGENTS.md during opensymphony init.
Build a harness-agnostic benchmarking system that compares providers, models, harnesses, and harness configurations by routing all benchmark traffic through a local LiteLLM proxy and storing normalized benchmark records in a project-owned database.
The system is designed for interactive benchmark sessions. The project does not run a harness-specific automation engine in the core path. It registers sessions, issues proxy credentials, renders harness environment snippets, ingests telemetry, normalizes data, and serves reports.
- LiteLLM is the single inference gateway.
- Every benchmark session must have a benchmark-owned
session_id. - Every session must be correlated to LiteLLM traffic through a session-scoped proxy credential and benchmark tags.
- The benchmark database is the source of truth for reporting.
- LiteLLM tables, logs, and Prometheus metrics are source inputs, not the reporting schema.
- Prompt and response content are off by default.
- The core code path must remain harness-agnostic.
- Session creation must capture benchmark metadata before any harness traffic starts.
- Collection and normalization jobs must be idempotent.
- Any change that weakens correlation, reproducibility, or redaction is a design bug.
Build and maintain:
- infrastructure config for local LiteLLM, Postgres, Prometheus, and Grafana
- typed config schemas for providers, harness profiles, variants, experiments, and task cards
- benchmark session registry and lifecycle services
- session credential issuance and harness env rendering
- LiteLLM request collection and normalization
- Prometheus metric collection and rollups
- comparison queries, exports, and dashboards
- security controls around secrets, retention, and redaction
Do not build product-specific logic into the core path. If two harnesses need different environment variable names, solve that through harness profiles and rendering templates, not bespoke control flow.
Use the terms below consistently across code, docs, and schema:
provider: upstream inference provider definitionharness_profile: how a harness is configured to talk to the proxyvariant: a benchmarkable combination of provider route, model, harness profile, and harness settingsexperiment: a named comparison grouping that contains one or more variantstask_card: the benchmark task definition used for comparable sessionssession: one interactive benchmark session under one variant and one task cardrequest: one normalized LLM call observed through LiteLLMmetric_rollup: derived latency, throughput, error, and cache metrics for a request, session, or comparison groupartifact: exported report or raw benchmark bundle
Every session must provide enough information to join all collected records. Preserve these keys whenever available:
- benchmark
session_id - benchmark
experiment_id - benchmark
variant_id - benchmark
task_card_id - LiteLLM virtual key ID and key alias
- request tags written by the session manager
- LiteLLM call ID or equivalent request ID
- upstream provider request ID when exposed
- timestamps in UTC
If a new collector does not preserve at least one stable request key and one stable session key, it is incomplete.
- Content capture is disabled by default.
- Store metadata, timings, counts, cache counters, request IDs, and routing fields.
- Any feature that persists prompts or responses must be guarded by an explicit config flag and redaction controls.
- Secrets must never be committed, logged, or copied into artifacts.
- Session credentials must be short-lived and scoped.
Every implementation change must:
- include typed config validation where applicable
- include unit tests for parsing, normalization, or aggregation logic
- include integration tests for service boundaries when practical
- update docs when behavior or contracts change
- preserve idempotent ingestion and deterministic rollups
Recommended module boundaries:
src/benchmark_core/- config models
- domain models
- repositories
- services
src/cli/- operator commands
- config validation
- session lifecycle commands
- export commands
src/collectors/- LiteLLM collection
- Prometheus collection
- normalization jobs
- rollup jobs
src/reporting/- comparison services
- serialization
- dashboard query helpers
src/api/- HTTP endpoints over the canonical query model
A valid session flow is:
- operator chooses variant and task card
- system creates a session row
- system creates a session-scoped proxy credential and alias
- system stores session metadata including repo path, git commit, and selected harness profile
- system renders the harness env snippet
- operator launches the harness manually
- collectors ingest and normalize traffic
- session is finalized with end time and summary rollups
Any shortcut that allows harness traffic before session registration breaks comparability.
A sub-issue is done when:
- scope is implemented
- acceptance criteria pass
- tests are green
- docs are updated
- the change can be understood by another engineer without hidden local context
The project uses uv for dependency management and provides a Makefile for common tasks.
# Install all dependencies including dev tools
make install-dev
# Run full quality check (lint + type-check + test)
make quality
# Run individual checks
make lint # Run ruff linter
make format # Run ruff formatter
make type-check # Run mypy type checker
make test # Run all testsmake install- Install production dependencies with uvmake install-dev- Install all dependencies including dev toolsmake sync- Sync dependencies from pyproject.tomlmake lint- Run ruff lintermake format- Run ruff formattermake format-check- Check formatting without modifying filesmake type-check- Run mypy type checkermake test- Run all testsmake test-unit- Run unit tests onlymake test-integration- Run integration tests onlymake test-cov- Run tests with coverage reportmake quality- Run full quality check (lint + type-check + test)make clean- Clean build artifacts and cache filesmake dev-setup- Complete setup for new development environmentmake dev-check- Quick check before committing
├── pyproject.toml # Project configuration and dependencies
├── Makefile # Development task runner
├── src/
│ ├── benchmark_core/ # Core domain (models, config, services, repositories)
│ ├── cli/ # CLI commands using Typer
│ ├── collectors/ # LiteLLM and Prometheus data collection
│ ├── reporting/ # Comparison services and serialization
│ └── api/ # FastAPI HTTP endpoints
└── tests/
├── unit/ # Unit tests for each package
└── integration/ # Integration tests
- Ruff: Configured in
pyproject.tomlfor linting and formatting (Python 3.11+ target) - mypy: Type checking with strict settings (disallow_untyped_defs)
- pytest: Test discovery and execution with asyncio support
- pytest-cov: Coverage reporting
Before starting a sub-issue, read:
README.md- this file
- the referenced docs listed in the sub-issue body
If the task depends on schema, config, or reporting behavior, also read the relevant contract document in docs/ before coding.