Skip to content

feat: Add Harbor Terminal-Bench comparison for agent effectiveness#199

Merged
jeremyeder merged 1 commit intoambient-code:mainfrom
jeremyeder:compare
Dec 10, 2025
Merged

feat: Add Harbor Terminal-Bench comparison for agent effectiveness#199
jeremyeder merged 1 commit intoambient-code:mainfrom
jeremyeder:compare

Conversation

@jeremyeder
Copy link
Copy Markdown
Contributor

@jeremyeder jeremyeder commented Dec 10, 2025

Summary

Add comprehensive Harbor integration to empirically measure Claude Code performance impact of agent files using Harbor's Terminal-Bench.

Key Features:

  • 🔬 A/B testing framework (with/without any agent file)
  • 📊 Statistical significance testing (t-tests, Cohen's d effect sizes)
  • 📄 Multiple output formats (JSON, Markdown, HTML)
  • 📈 Interactive dashboard with Chart.js visualizations (self-contained)
  • 🛠️ CLI commands: compare, list, view

Example Usage:

# Install Harbor
uv tool install harbor

# Compare any agent file (default: .claude/agents/doubleagent.md)
agentready harbor compare \
  -t adaptive-rejection-sampler \
  -t async-http-client \
  -t terminal-file-browser \
  --verbose \
  --open-dashboard

# Or specify a different agent file
agentready harbor compare \
  -t task1 -t task2 \
  --agent-file .claude/agents/my-custom-agent.md \
  --verbose

Architecture

Data Models (src/agentready/models/harbor.py):

  • HarborTaskResult - Single task result from result.json
  • HarborRunMetrics - Aggregated metrics per run
  • HarborComparison - Complete comparison with deltas and statistical tests

Services (src/agentready/services/harbor/):

  • HarborRunner - Execute Harbor CLI via subprocess
  • AgentFileToggler - Safely enable/disable agent files using context managers
  • ResultParser - Parse Harbor result.json files
  • HarborComparer - Calculate deltas and statistical significance
  • DashboardGenerator - Generate HTML reports with inlined Chart.js

Reporters (src/agentready/reporters/):

  • HarborMarkdownReporter - GitHub-Flavored Markdown reports
  • DashboardGenerator - Interactive HTML dashboard

Code Quality

Agent Reviews:

  • feature-dev:code-reviewer - No high-priority issues, excellent architecture
  • pr-review-toolkit:code-simplifier - Clean separation of concerns
  • doubleagent.md - 95% score, exemplary library-first architecture

Critical Fixes Applied:

  1. Fixed division by zero in delta calculations (returns None instead of 0.0)
  2. Removed manual toggler.enable() call (context manager handles restoration)
  3. Inlined Chart.js library (maintains self-contained HTML report principle)

Test Coverage:

  • 24/24 tests passing ✅
  • 98% models coverage
  • 95%+ services coverage
  • Comprehensive edge case testing (exceptions, idempotency, invalid data)

Linters:

  • black - Code formatting
  • isort - Import sorting
  • ruff - Linting (all checks passed)
  • ✅ Pre-commit hooks - All passing

Output Files

Results stored in .agentready/harbor_comparisons/ (gitignored):

  • JSON: Machine-readable comparison data
  • Markdown: GitHub-friendly report (commit this for PRs)
  • HTML: Interactive dashboard with Chart.js visualizations (self-contained, ~210KB)

Symlinks (for quick access):

  • comparison_latest.json
  • comparison_latest.md
  • comparison_latest.html

Statistical Methods

Significance Criteria (both required):

  • P-value < 0.05: 95% confidence (two-sample t-test)
  • Cohen's d effect size:
    • Small: 0.2 ≤ |d| < 0.5
    • Medium: 0.5 ≤ |d| < 0.8
    • Large: |d| ≥ 0.8

Sample Size Recommendations:

  • Minimum: 3 tasks (for statistical tests)
  • Recommended: 5-10 tasks (reliable results)
  • Comprehensive: 20+ tasks (production validation)

Use Cases

1. Validate Agent File Effectiveness

agentready harbor compare \
  --agent-file .claude/agents/my-agent.md \
  -t task1 -t task2 -t task3

2. Compare Different Agents
Run twice with different --agent-file values, then compare the JSON results.

3. A/B Test Agent Modifications
Before/after comparison when iterating on agent designs.

4. Benchmark Agent Performance
Quantify agent impact on success rate, duration, and task completion.

Documentation

  • ✅ User guide: docs/harbor-comparison-guide.md (comprehensive, 400 lines)
  • ✅ Developer guide: CLAUDE.md (Harbor section added, 100 lines)
  • ✅ Code comments: Extensive docstrings and inline comments

Test Plan

  • Run Harbor comparison with 3 tasks (verified all metrics calculate correctly)
  • Test with/without agent file (toggler context manager works)
  • Verify statistical calculations (t-tests, Cohen's d, significance flags)
  • Generate all output formats (JSON, Markdown, HTML)
  • Test symlink creation (cross-platform compatibility)
  • Verify error handling (missing Harbor, invalid JSON, failed benchmarks)
  • Test edge cases (zero duration, all failures, partial data)
  • Run unit tests (24/24 passing)
  • Run linters (black, isort, ruff - all passing)

Related Issues

Provides empirical validation for any agent file's impact on Claude Code performance.

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant