Primer is the harness intelligence layer for agentic engineering. Research and industry evidence converge on a single insight: outcome quality is determined more by the agent harness — tool design, context management, caching, orchestration, and permission boundaries — than by model capability alone. Primer captures session telemetry across agents, decomposes it into harness dimensions, and measures which configurations actually improve outcomes. It turns that data into harness attribution, coaching, enablement, and operational decisions.
This roadmap is organized in two layers:
- Strategy and priorities at the top.
- Detailed shipped and planned capabilities underneath.
Items marked with [x] are shipped. Items marked with [ ] are planned. Planned items include rough priority tags:
P0- foundational and near-termP1- important follow-on workP2- valuable expansion work
- Harness effectiveness: Which harness configurations (tool designs, caching strategies, context management, orchestration patterns, permission boundaries) correlate with better outcomes?
- Harness attribution: When a session succeeds or fails, which harness components contributed? What's the per-step compound reliability?
- Harness evolution: How have harness configurations changed over time, and did those changes improve outcomes?
- Harnessability: Are codebases and teams structurally ready for effective agent harnesses (documentation quality, typing, module boundaries, data governance)?
- Dead weight: Which harness configurations are outdated compensations for older model limitations that now bottleneck performance?
- Individual effectiveness: What should each engineer change about their harness setup to improve outcomes?
- Measure harness effectiveness, not just usage — decompose outcomes to the harness component level.
- Build the "code coverage for harnesses" that the industry is asking for (per-component reliability, compound failure math).
- Track longitudinal harness evolution so teams can see how configuration changes correlate with outcome changes over time.
- Make harnessability scoring a first-class product surface (documentation quality, context freshness, guide/sensor coverage).
- Close the loop from harness insight to intervention — including subtractive coaching ("what can you stop doing?").
- Bring harness intelligence into the engineer workflow via MCP sidecar, not only after the fact.
Primer's moat is decomposing outcomes to the harness component level. Per-tool success rates, compound reliability math (10 steps at 99% = 90.4% end-to-end), and harness configuration fingerprinting from session telemetry. This is the "code coverage for harnesses" that the industry is asking for.
Trustworthy semantics remain foundational. Clean taxonomy for outcomes, goals, friction, and success, plus reprocessing and coverage tooling so every downstream metric is credible.
Longitudinal correlation of harness configuration changes with outcome changes over time. LangChain rewrote their harness 4x in one year; Vercel removed 80% of tools and improved. No tool tracks this — Primer's team-level time-series data is uniquely positioned.
Measure whether codebases and teams have the structural properties (documentation quality, context freshness, module boundaries, guide/sensor coverage) that make agent harnesses effective. Extends existing project readiness into a full harnessability assessment.
Recommendations should become assignable, measurable interventions — including subtractive coaching ("what can you stop doing?"). Dead weight detection identifies harness configurations that are outdated compensations for older model limitations.
The most valuable insights should show up during the session via MCP sidecar: harness health scores, context quality warnings, dead weight alerts, and configuration recommendations.
Derived data pipelines, performance optimization, durable background jobs, enterprise identity, and observability.
P0Per-tool success rate tracking with compound reliability computation — decompose session outcomes to the tool/step level.P0Harness configuration fingerprinting — extract and catalog the actual harness configuration (tools, context files, permissions, customizations) from session telemetry.P0Context quality scoring — measure AGENTS.md freshness, token efficiency, and guide/sensor coverage per project.P1Harness evolution timeline — before/after correlation of configuration changes with outcome changes.P1Harnessability scoring per project — documentation quality, typing strength, module boundaries, data governance readiness.P1Paragon's 4-dimension evaluation — tool correctness, tool usage accuracy, task completion, task efficiency.P1Semantic search over sessions via pgvector — exemplar discovery and cross-engineer pattern matching.P2Automated harness optimization suggestions — evolve coaching to recommend specific harness configuration changes.
- Facet taxonomy alignment across extraction, schemas, analytics, and UI
- Outcome normalization and historical backfill for previously ingested sessions
- Coverage dashboard for facet extraction, transcript completeness, GitHub sync, and repository metadata
- Confidence scoring for extracted facets and downstream recommendations
- Cross-agent schema parity matrix so Primer knows which session fields are required, optional, or unavailable per source
- Partial-telemetry handling for IDE-native agents like Cursor so missing transcript, tool, or model fields do not distort org-wide metrics
- [P1] Execution evidence capture: lint, test, build, and verification signals per session
- [P1] Change-shape capture: files touched, diff size, churn, and rewrite/revert indicators
- [P1] Recovery-path tracking: detect whether engineers recover after friction or abandon the attempt
- [P1] Derived analytics tables and materialized rollups for heavy longitudinal queries
- Source-quality dashboard by agent type, including capture coverage and telemetry completeness for Cursor
- [P2] Data-quality anomaly detection for broken ingestion, sparse transcripts, or stale integrations
- Session search with full-text, outcome, type, model, branch filters
- Transcript viewer with message-level detail
- Session health scoring (outcome + friction + duration + satisfaction composite)
- LLM-powered facet extraction (goals, friction types, satisfaction signals)
- End reason breakdown with success rate per reason
- Goal analytics (session type and goal category breakdown)
- Permission mode analysis (success rate by permission level)
- Satisfaction trend tracking (satisfied / neutral / dissatisfied over time)
- Similar sessions panel with 3-tier relevance matching
- [P0] Cursor session ingestion and discovery pipeline
- [P0] Cursor transcript and tool-call extraction mapped onto the normalized session model
- [P1] Cursor native telemetry enrichment for approvals, change shape, and context-usage signals
- [P1] Cursor reliable token and model-usage extraction once source telemetry is trustworthy
- [P1] Workflow fingerprinting: infer common sequences like search -> read -> edit -> test -> fix
- [P1] Cursor-specific workflow fingerprinting and session archetype mapping
- [P1] Session archetype detection: debugging, feature delivery, refactor, migration, docs, investigation
- [P1] Delegation graph capture for multi-agent and subagent workflows
- [P2] Exemplar session library for high-value workflows and onboarding examples
- [P2] Skill, command, and template reuse analytics by workflow and outcome
- [P2] Prompt reuse analytics by workflow and outcome
- Friction type classification (permission denied, timeout, context limit, edit conflict, tool error, exec error)
- Friction impact scoring (occurrence count x success rate penalty)
- Friction trend chart (count + rate over time)
- Project-level friction breakdown
- Friction cluster analysis with sample details
- Anomaly detection for friction spikes
- [P0] Root-cause clustering from transcripts, tool traces, and repeated failure motifs
- [P1] Time-lost estimation per friction type, engineer, and project
- [P1] Toolchain reliability analytics for MCP servers, built-in tools, and external services
- [P1] Friction recovery analysis: what engineers tried after failure and which recoveries worked
- [P2] Real-time friction detection for in-session intervention
- Engineer leaderboard with multi-dimensional ranking
- Personal trajectory dashboard with weekly sparklines
- Strengths and friction breakdown per engineer
- Peer benchmarking (percentile ranking, vs-team-average deltas)
- AI-generated narrative insights per engineer
- Personalized tips based on friction patterns and tool gaps
- Config optimization suggestions from team benchmark comparison
- Skill inventory with proficiency levels per tool
- Learning paths generated from high-performer patterns
- [P0] Effectiveness score: success rate, cost efficiency, quality outcomes, and follow-through
- [P0] Workflow playbooks derived from high-performing peer patterns
- [P1] Plugin and tool recommendation engine based on task type, project context, and similar successful sessions
- [P1] Model selection coach for cost-appropriate model choice by task
- [P1] Personal impact review that combines trajectory, quality, cost, and workflow maturity
- [P2] Longitudinal growth view across quarters, role changes, and team moves
- Cohort comparison (new hire / ramping / experienced)
- Time-to-team-average tracking for new hires
- Onboarding velocity scoring
- Onboarding recommendations
- Shared behavior pattern discovery with approach comparison
- [P1] Bright spot detection: explicitly surface high performers and cross-pollinate their patterns
- [P1] Exemplar-session-to-learning-path pipeline
- [P1] Team skill gap mapping by workflow, tool category, and project context
- [P2] Coaching program measurement: which onboarding or training changes improved outcomes
- Dedicated project workspace with readiness, friction, quality, cost, and enablement views
- Project AI-readiness scoring (CLAUDE.md, AGENTS.md, .claude/ detection)
- Project scorecard that combines adoption, effectiveness, quality, and cost efficiency
- [P0] Project-level workflow fingerprints and friction hotspots
- [P1] Project-level agent mix comparison, including Cursor sessions alongside CLI agents
- [P1] Repository context model: language mix, test maturity, repo size, and AI-enablement signals
- [P1] Project enablement recommendations tied to observed bottlenecks
- [P1] Cross-project comparison: which repos are easiest or hardest to use AI effectively in
- [P2] Project playbook templates for greenfield, legacy, high-compliance, and test-poor repos
- Tool leverage scoring (0-100 composite per engineer)
- Tool category classification (core, search, orchestration, skill, MCP)
- Orchestration adoption rate tracking
- Agent and skill usage analytics (invocation patterns, delegation depth)
- Tool adoption rates and trend charts
- Engineer tool proficiency table
- Daily leverage trend tracking
- [P0] 5-factor harness maturity score: tool design, orchestration, caching, context hygiene, boundary design
- [P0] Dead weight detection: flag zero-invocation and no-outcome-lift customizations
- [P0] Subtractive coaching: "what you can stop doing" section in coaching briefs
- [P0]
GET /api/v1/harness/deadweightendpoint with auth-scoped access - [P1] Model diversity factor in leverage scoring
- [P1] Agent team detection for coordinated multi-agent orchestration
- [P1] Session customization snapshot: capture enabled MCP servers, subagents, skills, commands, and templates alongside what was actually invoked
- [P1] Tool source classification: built-in vs marketplace vs custom
- [P1] Skill provenance + baseline filtering so recommendations and reuse analytics suppress built-in/default skills and focus on explicit user or repo-configured choices
- [P1] Cross-agent customization normalization so Claude, Cursor, Codex, and Gemini plugin surfaces map into one shared model
- [P1] Customization state model: available vs enabled vs invoked for MCPs, subagents, skills, commands, and templates
- [P1] Outcome attribution for customizations: which MCPs, skills, commands, and subagents improve workflow, quality, cost, and friction outcomes
- [P1] Cross-team tooling landscape: overlap, reuse, and local best-of-breed tools
- [P1] High-performer agent stack analysis: which combinations of MCPs, skills, commands, and subagents differentiate top performers
- [P0] Per-tool success rate tracking with compound reliability computation (10 steps at 99% = 90.4% end-to-end)
- [P0] Harness configuration fingerprinting from session telemetry (tools, context files, permissions, customizations)
- [P1] Context quality scoring: AGENTS.md freshness, token efficiency, guide/sensor coverage
- [P1] Harness evolution timeline: before/after correlation of configuration changes with outcome changes
- [P1] Harnessability scoring per project: documentation quality, typing strength, module boundaries
- [P1] Paragon's 4-dimension evaluation: tool correctness, tool usage accuracy, task completion, task efficiency
- [P2] Prompt, skill, and template maturity scoring
- [P2] Automated harness optimization suggestions
- [P2] Dead weight dashboard tab with per-customization detail and removal actions
- GitHub OAuth SSO
- Pull request sync via GitHub App
- Commit correlation with sessions
- Claude-assisted vs non-Claude PR comparison (merge rate, review comments, time to merge)
- Quality by session type (debugging, feature, refactoring)
- Code volume tracking (daily lines added/deleted)
- Engineer quality ranking table
- Repository AI-readiness scoring
- Automated review findings tracker (BugBot parser, severity breakdown, fix rate)
- Review findings overview in quality dashboard and engineer profile
-
GET /api/v1/analytics/review-findingsendpoint with source/severity/status filters - Quality attribution layer linking session behavior to PR outcomes and review findings
- [P1] Additional review bot parsers: CodeRabbit, SonarQube, and other automated review tools
- [P1] Post-merge outcome tracking: reverts, hotfixes, and follow-up bug volume
- [P1] Change-quality analysis by workflow fingerprint and session archetype
- [P2] Review remediation tracking from finding creation to fix completion
- Per-model spend tracking with daily cost chart
- Cost breakdown by model
- Cache efficiency analytics (hit rates, savings, per-engineer potential)
- Billing mode detection (API vs subscription)
- Subscription vs API cost modeling with optimal plan recommendations
- 30-day cost forecasting (linear regression with confidence bands)
- Budget tracking with burn-rate alerts and projected overrun warnings
- Cost per successful outcome metric
- [P1] Break-even analysis for API vs seat-based pricing with per-engineer recommendations
- [P1] Cost per workflow archetype and cost per engineering outcome
- [P1] Workflow compare mode for archetype and fingerprint performance
- [P1] Model-choice opportunity scoring for overspend reduction
- [P2] Budget policy simulation by team, project, and billing model
- AI-generated narrative reports (engineer, team, org scope)
- Narrative caching with TTL-based expiry
- Auto-refresh via lifespan task
- Conversational data explorer (SSE-streamed tool-use chat)
- AI-powered recommendations panel
- [P1] Saved explorer prompts and reusable report cards
- [P1] Compare mode for engineer, team, project, and time-period analysis
- [P2] Weekly manager review packs that combine quality, friction, growth, and cost
- [P2] Recommendation narratives that explain why a workflow is likely to help
- [P1] Reposition the website around harness intelligence for agentic engineering
- [P1] Showcase harness effectiveness, cost attribution, quality, and exemplar sessions as the core proof points
- [P0] Recommendation-to-intervention workflow with owner, status, due date, and linked evidence
- [P0] Before-and-after measurement for coaching, tooling, or repo changes
- [P1] Experimentation layer for training rollouts, tool changes, and enablement playbooks
- [P1] Intervention effectiveness reporting by team, project, and engineer cohort
- [P2] Auto-generated next-step plans from alerts, narratives, and project findings
- MCP sidecar with on-demand stats, friction reports, and recommendations
- [P0] Proactive coaching skill that activates at session start with contextual suggestions
- [P0] Live session signals that stream friction, satisfaction, and risk as work happens
- [P1] In-session workflow nudges based on project playbooks and prior failures
- [P1] Daily and weekly personal recaps inside the sidecar
- [P2] Lightweight session planning prompts before complex work begins
- Hub-and-spoke dashboard with KPI strip, activity section, attention alerts, deep-dive cards
- Custom date range picker (7d / 30d / 90d / 1y presets + custom)
- Team management with member stats
- Role-based access control (engineer, team lead, admin)
- Admin panel (engineer/team management, audit log, system stats)
- Alert system with configurable thresholds, acknowledge/dismiss workflow
- Slack notification integration
- CSV and PDF export
- API rate limiting
- Dark mode with system preference detection
- [P1] Activation and setup hub for GitHub, budgets, alerts, narrative readiness, and data freshness
- [P1] Performance measurement views for leadership across productivity, quality, cost, and adoption
- [P1] Threshold resolution and policy management that matches actual alerting behavior
- [P1] Device-scoped ingest tokens for hooks and sidecar, backed by authenticated engineer identity instead of long-lived engineer API keys
- [P1] One-time setup codes that exchange browser-authenticated engineers into local device tokens
- [P2] Multi-tenant workspace isolation for multiple organizations on a shared Primer instance
- [P2] Enterprise IdP support with SAML and OIDC for provisioning and SSO
- Multi-agent support (Claude Code, Codex CLI, Gemini CLI, Cursor)
- SessionEnd hook system with agent-specific installers
-
primer sync --watchfor agents without hook systems - Docker Compose and Kubernetes Helm deployment
- PostgreSQL and SQLite support
- Alembic migration bundling in pip package
- [P0] Cursor
agent_typesupport across capture, sync, ingest, and analytics filters - [P0] Durable background job system for sync, facet extraction, narratives, and alerts
- [P0] Scalable API key lookup and verification strategy
- [P1] Source-capability registry so Primer can safely gate analytics by what each agent source actually provides
- [P1] OpenTelemetry integration for metrics, traces, and logs
- [P1] Redis-backed caching for analytics query results and high-read metadata
- [P1] Analytics performance work for large orgs and concurrent dashboard usage
- [P2] Pluggable warehouse export for long-horizon analysis in external BI tools