Your agent says "done." But do you know how it fails?
Not whether it fails. How.
verify() runs your agent's edits against filesystem reality and tells you what the agent got wrong — the file, the line, the expected value, the actual value. No LLM in the verification path. The answer is not "probably."
Over time, these measurements accumulate into a reliability profile: how this agent fails on this codebase. What it hallucinates. Where it drops edits. Which patterns it repeats.
No other tool builds this model. Linters know your code has problems. Tests know your code produces wrong output. Verify knows why the agent was wrong.
Using verify? We'd love to hear what you're building. Join the discussion
npx @sovereign-labs/verify demoThree failure modes your current stack misses:
The agent claims it saved a file. It didn't. Verify checks the filesystem.
Without verify:
Agent says: "Report saved successfully."
$ ls reports/weekly.md
ls: cannot access 'reports/weekly.md': No such file or directory
With verify:
Trace 1: Agent claims completion without creating the file.
[FAIL] Filesystem gate: reports/weekly.md does not exist.
Trace 2: Injecting constraints and re-running. Agent creates the file.
[PASS] All gates passed (12 checks)
The agent writes valid CSS targeting a selector that doesn't exist. Verify knows what's actually in your code.
Without verify:
$ grep '.profile-nav' server.js # CSS rule exists
$ grep -c 'class="profile-nav"' # 0 — element doesn't exist
With verify:
Trace 1: Agent uses selector .profile-nav
[FAIL] Grounding: .profile-nav does not exist in source
Trace 2: Agent uses a.nav-link — exists in reality.
[PASS] All gates passed (12 checks)
The agent completed the task. But it also quietly changed your config. Verify catches the undeclared mutation.
Without verify:
$ diff config.json.orig config.json
- "darkMode": true
+ "darkMode": false
- "analytics": false
+ "analytics": true
With verify:
Trace 1: Agent edits server.js and config.json.
[FAIL] Containment: 2 undeclared file mutations detected
Trace 2: Agent edits server.js only.
[PASS] All gates passed (11 checks)
Run all three: npx @sovereign-labs/verify demo --scenario=liar|world|drift
import { verify } from '@sovereign-labs/verify';
const result = await verify(edits, predicates, { appDir: './my-app' });
// result.success → true/false
// result.attestation → human-readable summary
// result.narrowing → what to try next (on failure)26 checks run in sequence. First failure stops the pipeline and tells you exactly what went wrong.
- Can the edit be applied? Does the search string exist in the file?
- Is the edit safe? No XSS, no SQL injection, no leaked secrets, no broken accessibility.
- Did the edit work? CSS selector has the right value. HTTP endpoint returns 200. Database column exists. File was created.
- Did the edit break anything else? Health checks pass. File integrity holds. Config is consistent.
On failure: returns the problem + what to try next. On repeat failure: learns from mistakes — attempt N+1 won't repeat attempt N's error.
npm install @sovereign-labs/verify
# or
bun add @sovereign-labs/verifyimport { verify } from '@sovereign-labs/verify';
const result = await verify(
// Edits: search-and-replace mutations
[
{ file: 'server.js', search: 'color: blue', replace: 'color: red' },
{ file: 'server.js', search: 'Hello', replace: 'Welcome' },
],
// Predicates: what should be true after the edits
[
{ type: 'css', selector: 'h1', property: 'color', expected: 'red' },
{ type: 'content', file: 'server.js', pattern: 'Welcome' },
{ type: 'http', path: '/health', method: 'GET', expect: { status: 200 } },
],
// Config
{ appDir: './my-app' }
);
if (result.success) {
console.log(result.attestation);
} else {
console.log(result.narrowing.resolutionHint);
}verify() is a single pass. govern() wraps it in a convergence loop — ground reality, plan, verify, narrow, retry. The agent learns from every failure.
import { govern } from '@sovereign-labs/verify';
const result = await govern({
appDir: './my-app',
goal: 'Change the button color to orange',
maxAttempts: 3,
// Your agent — one method: plan
agent: {
plan: async (goal, context) => {
// context.grounding — CSS, HTML, routes, DB schema
// context.narrowing — what failed last time and why
// context.constraints — what's banned and why (K5)
return {
edits: [{ file: 'style.css', search: 'blue', replace: 'orange' }],
predicates: [{ type: 'css', selector: '.btn', property: 'color', expected: 'orange' }],
};
},
},
});
if (result.success) {
console.log(`Converged in ${result.attempts} attempt(s)`);
} else {
console.log(`Stopped: ${result.stopReason}`);
// 'exhausted' | 'stuck' | 'empty_plan_stall' | 'approval_aborted'
}npx @sovereign-labs/verify init # Create .verify/check.json
npx @sovereign-labs/verify check # Run verification
npx @sovereign-labs/verify demo # See what it catches
npx @sovereign-labs/verify ground # Scan CSS/HTML/routes
npx @sovereign-labs/verify self-test # Run 2,800+ scenario harness
git diff | npx @sovereign-labs/verify check --diff # Pipe git diff{
"mcpServers": {
"verify": {
"command": "npx",
"args": ["@sovereign-labs/verify", "mcp"]
}
}
}Tools: verify_ground, verify_read, verify_submit
Multiple agents editing the same codebase? Verify them in sequence — each agent sees the filesystem the previous agent left behind.
import { verifyBatch } from '@sovereign-labs/verify';
const result = await verifyBatch([
{ agent: 'planner', edits: [...], predicates: [...] },
{ agent: 'coder', edits: [...], predicates: [...] },
], { appDir: './my-app', stopOnFailure: true });If Agent A's changes invalidate Agent B's predicates, the grounding gate catches it. No new infrastructure — the existing gates handle multi-agent conflicts naturally.
The checks are domain-agnostic:
- File system agents — move, rename, organize files
- Infrastructure agents — don't delete the production database
- Communication agents — message the right channel, no forbidden content
- Document agents — don't overwrite the wrong cells
When a PR contains .sql migration files, verify also runs the migration verification pipeline — a separate set of gates that parse the migration with libpg-query, replay the schema from prior migrations on the base branch, and check the new migration against that schema.
The first shipped rule is DM-18 (NOT NULL without safe preconditions): ADD COLUMN x NOT NULL without a DEFAULT, or SET NOT NULL on a nullable column with no default. Both will fail on any non-empty production table — the classic 3am migration failure.
Measured precision: 19 true positives, 0 false positives across 761 production migrations from cal.com, formbricks, and supabase. See MEASURED-CLAIMS.md for full methodology and reproduction steps.
DM-18 is the first vertical of verify's three-vertical product strategy (code-edit verification, database migration verification, HTTP contract verification). Eight other migration shapes (DM-01..05 grounding, DM-15..17, DM-19 safety) are implemented and shipping in CI as warnings while they're calibrated against the corpus. See the Database Migration Failures section of FAILURE-TAXONOMY.md for the full shape catalog.
Findings can be acknowledged in the migration file with -- verify: ack DM-XX <reason> to downgrade them to warnings (audit trail) rather than blocks.
We scanned every PR in the AIDev-POP dataset — 33,056 real pull requests from 5 AI coding agents across 2,807 popular open-source repos. Deterministic pipeline, $0 cost, no LLM calls.
High-confidence structural finding rates:
| Agent | PRs | Finding Rate | Top Issue |
|---|---|---|---|
| Devin | 4,800 | 8.2% | Unbounded queries |
| Claude Code | 457 | 8.5% | Path/permission |
| Copilot | 4,496 | 4.8% | Path/permission |
| Cursor | 1,539 | 4.4% | Unbounded queries |
| Codex | 21,764 | 1.9% | Unbounded queries |
3.4% of all agent PRs have high-confidence structural issues that existing CI doesn't catch. See METHODOLOGY.md for full details.
- uses: Born14/verify@v0.8.2Runs verify on every PR. Posts gate results as a comment. Three modes:
- Structural (default, free) — diff-only analysis, no API key needed
- Intent — extracts predicates from PR title/description (Gemini, OpenAI, or Anthropic)
- Staging — Docker build + runtime verification
- FAILURE-TAXONOMY.md — Reference catalog of failure shapes verify's gates can detect, with calibration status per section. Includes the new Database Migration Failures section (DM-01..19).
- MEASURED-CLAIMS.md — DM-18 measured precision (19 TP / 0 FP / 761 migrations) with full methodology and reproduction steps. The first shape in the taxonomy with a published false-positive rate.
- REFERENCE.md — Gates, predicates, configuration, CLI, fault management
- HOW-IT-WORKS.md — Architecture, the 8-stage autonomous loop, migration verification pipeline
- METHODOLOGY.md — AIDev-POP scan methodology and reproducibility (separate from migration corpus methodology, which is in MEASURED-CLAIMS.md)
- PARITY-GRID.md — 8×10 capability × failure class coverage matrix
- ASSESSMENT.md — What verify is and isn't
- ROADMAP.md — Current state and priorities
- GLOSSARY.md — Terms and definitions
MIT