feat: Add Harbor Terminal-Bench comparison for agent effectiveness by jeremyeder · Pull Request #199 · ambient-code/agentready

jeremyeder · 2025-12-10T18:07:33Z

Summary

Add comprehensive Harbor integration to empirically measure Claude Code performance impact of agent files using Harbor's Terminal-Bench.

Key Features:

🔬 A/B testing framework (with/without any agent file)
📊 Statistical significance testing (t-tests, Cohen's d effect sizes)
📄 Multiple output formats (JSON, Markdown, HTML)
📈 Interactive dashboard with Chart.js visualizations (self-contained)
🛠️ CLI commands: compare, list, view

Example Usage:

# Install Harbor
uv tool install harbor

# Compare any agent file (default: .claude/agents/doubleagent.md)
agentready harbor compare \
  -t adaptive-rejection-sampler \
  -t async-http-client \
  -t terminal-file-browser \
  --verbose \
  --open-dashboard

# Or specify a different agent file
agentready harbor compare \
  -t task1 -t task2 \
  --agent-file .claude/agents/my-custom-agent.md \
  --verbose

Architecture

Data Models (src/agentready/models/harbor.py):

HarborTaskResult - Single task result from result.json
HarborRunMetrics - Aggregated metrics per run
HarborComparison - Complete comparison with deltas and statistical tests

Services (src/agentready/services/harbor/):

HarborRunner - Execute Harbor CLI via subprocess
AgentFileToggler - Safely enable/disable agent files using context managers
ResultParser - Parse Harbor result.json files
HarborComparer - Calculate deltas and statistical significance
DashboardGenerator - Generate HTML reports with inlined Chart.js

Reporters (src/agentready/reporters/):

HarborMarkdownReporter - GitHub-Flavored Markdown reports
DashboardGenerator - Interactive HTML dashboard

Code Quality

Agent Reviews:

✅ feature-dev:code-reviewer - No high-priority issues, excellent architecture
✅ pr-review-toolkit:code-simplifier - Clean separation of concerns
✅ doubleagent.md - 95% score, exemplary library-first architecture

Critical Fixes Applied:

Fixed division by zero in delta calculations (returns None instead of 0.0)
Removed manual toggler.enable() call (context manager handles restoration)
Inlined Chart.js library (maintains self-contained HTML report principle)

Test Coverage:

24/24 tests passing ✅
98% models coverage
95%+ services coverage
Comprehensive edge case testing (exceptions, idempotency, invalid data)

Linters:

✅ black - Code formatting
✅ isort - Import sorting
✅ ruff - Linting (all checks passed)
✅ Pre-commit hooks - All passing

Output Files

Results stored in .agentready/harbor_comparisons/ (gitignored):

JSON: Machine-readable comparison data
Markdown: GitHub-friendly report (commit this for PRs)
HTML: Interactive dashboard with Chart.js visualizations (self-contained, ~210KB)

Symlinks (for quick access):

comparison_latest.json
comparison_latest.md
comparison_latest.html

Statistical Methods

Significance Criteria (both required):

P-value < 0.05: 95% confidence (two-sample t-test)
Cohen's d effect size:
- Small: 0.2 ≤ |d| < 0.5
- Medium: 0.5 ≤ |d| < 0.8
- Large: |d| ≥ 0.8

Sample Size Recommendations:

Minimum: 3 tasks (for statistical tests)
Recommended: 5-10 tasks (reliable results)
Comprehensive: 20+ tasks (production validation)

Use Cases

1. Validate Agent File Effectiveness

agentready harbor compare \
  --agent-file .claude/agents/my-agent.md \
  -t task1 -t task2 -t task3

2. Compare Different Agents
Run twice with different --agent-file values, then compare the JSON results.

3. A/B Test Agent Modifications
Before/after comparison when iterating on agent designs.

4. Benchmark Agent Performance
Quantify agent impact on success rate, duration, and task completion.

Documentation

✅ User guide: docs/harbor-comparison-guide.md (comprehensive, 400 lines)
✅ Developer guide: CLAUDE.md (Harbor section added, 100 lines)
✅ Code comments: Extensive docstrings and inline comments

Test Plan

Run Harbor comparison with 3 tasks (verified all metrics calculate correctly)
Test with/without agent file (toggler context manager works)
Verify statistical calculations (t-tests, Cohen's d, significance flags)
Generate all output formats (JSON, Markdown, HTML)
Test symlink creation (cross-platform compatibility)
Verify error handling (missing Harbor, invalid JSON, failed benchmarks)
Test edge cases (zero duration, all failures, partial data)
Run unit tests (24/24 passing)
Run linters (black, isort, ruff - all passing)

Related Issues

Provides empirical validation for any agent file's impact on Claude Code performance.

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Harbor Terminal-Bench comparison for agent effectiveness#199

feat: Add Harbor Terminal-Bench comparison for agent effectiveness#199
jeremyeder merged 1 commit intoambient-code:mainfrom
jeremyeder:compare

jeremyeder commented Dec 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeremyeder commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Code Quality

Output Files

Statistical Methods

Use Cases

Documentation

Test Plan

Related Issues

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jeremyeder commented Dec 10, 2025 •

edited

Loading