copilot-eval

A/B evaluation framework for GitHub Copilot CLI customizations using OpenTelemetry telemetry.

Measure the effect of plugins, custom instructions, MCP servers, and other Copilot customizations with reproducible, containerized eval runs and automated analysis.

Quick Start

git clone https://github.com/openjny/copilot-eval.git
cd copilot-eval

# Prerequisites: Docker, uv, gh auth login
cp .env.example .env    # Configure credentials

Try the prompt-language example

# Run eval (2 tasks × 2 variants × 3 epochs = 12 runs, ~2 min)
uv run copilot-eval run --config-dir examples/prompt-language

# Analyze
uv run copilot-eval analyze --run-id <RUN_ID> --config-dir examples/prompt-language -o markdown

CLI

uv run copilot-eval <command> [options]

Command	Description
`list --config-dir <dir>`	List tasks and variants
`build --config-dir <dir>`	Build Docker images
`run --config-dir <dir> [--task NAME] [--epochs N]`	Execute eval runs
`analyze --run-id <ID> [--config-dir <dir>] [-o table\|json\|markdown]`	Analyze results

Examples

Example	What it evaluates
prompt-language	English vs Japanese prompts on code tasks
azure-skills	Azure Skills Plugin impact on Azure operations

Documentation

Configuration Guide — eval-config.yaml, evaluators, fixtures, hooks, parallel modes
Architecture — execution flow, Docker design, OTel tracing, report generation

Project Structure

copilot-eval/
├── eval/                  # Framework
│   ├── cli.py             # CLI entry point
│   ├── config.py          # Config loading
│   ├── runner.py          # Docker execution + evaluators
│   ├── trace.py           # Jaeger trace parsing
│   └── report.py          # A/B comparison reports
├── docker/
│   ├── Dockerfile         # Base image (Node 20 + Copilot CLI)
│   └── entrypoint.sh      # Auth merging
├── examples/              # Eval sets
├── docs/                  # Detailed documentation
└── docker-compose.yml     # Jaeger

The framework tags each run with eval.test_id, eval.variant, eval.scenario, and eval.epoch via OTEL_RESOURCE_ATTRIBUTES, enabling A/B comparison in Jaeger.

Note: COPILOT_HOME must be writable for OTel span correlation to work correctly. The entrypoint handles this by copying auth from a read-only mount to a writable directory.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.github		.github
docker		docker
docs		docs
eval		eval
examples		examples
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
eval-config.yaml		eval-config.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

copilot-eval

Quick Start

Try the prompt-language example

CLI

Examples

Documentation

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

copilot-eval

Quick Start

Try the prompt-language example

CLI

Examples

Documentation

Project Structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages