SIN-Code Verification Oracle (Repo 6)

The missing piece of the SIN-Code stack: an independent, execution-based verification layer that answers the one question agents systematically get wrong — "Did it actually work?" — without trusting the agent's self-report.

Where the other five repos are mostly static analysis (the signal agents already have the most of), this repo provides ground truth: compiler output, real test runs, real HTTP responses, and behavioral diffs of actual execution. Plus an eval harness so you can measure whether your stack improves results instead of believing it does.

Why this matters

Failure mode	Caught by	Caught here
Hallucinated "done", nothing runs	nobody	Execution Oracle
Type error introduced	maybe linter	Diagnostics Oracle
Tests pass but API response silently changed	nobody	Trace-Diff
"Is the stack even helping?"	nobody	Eval Harness

Components

diagnostics.py — Adapts existing language servers/compilers/linters (pyright, ruff, tsc) as oracles. Degrades gracefully when a tool is absent. This is the cheapest, strongest correctness signal — and reuses battle-tested tools instead of re-implementing weaker AST checks.
execution.py — Ground-truth runner: arbitrary commands, parsed pytest counts, and HTTP probes that boot a server, wait for the port, and assert on real responses.
trace_diff.py — Captures observable behavior (stdout, exit code, artifacts, structured events) and diffs two runs, with noise normalization (timestamps/uuids/addresses) for determinism.
oracle.py — Combines signals into a single Verdict. Refuses to return PASS when it had no ground truth (returns UNVERIFIED instead).
eval_harness.py — SWE-bench-style runner. Tasks have hidden verification commands the agent never sees; resolution is judged purely by the Execution Oracle.

Install

cd SIN-Code-Verification-Oracle
pip install -e .
# optional, for MCP server:
pip install -e '.[mcp]'
# install the language tools you want as oracles:
pip install pyright ruff   # npm i -g typescript  for tsc

Usage

# Independent verdict — exits non-zero on FAIL so CI/agent loops can gate on it
oracle verify --test pytest --build "python -m compileall ."

# Just the diagnostics oracle
oracle diagnostics .

# Behavioral trace diff: capture before the edit, diff after
oracle trace-capture "python app.py --selfcheck" --out before.json
# ... agent edits code ...
oracle trace-diff "python app.py --selfcheck" --before before.json

# Measure your agent against a suite (baseline shown with no-op agent)
oracle eval examples/suite.example.json --label baseline

MCP integration

# ~/.config/opencode/config.yaml (or Codex/Hermes equivalent)
mcpServers:
  sin-code-oracle:
    command: oracle
    args: [serve]

The key tool is verify_change: agents should call it before reporting a task complete. The returned Verdict includes verified and confidence, so the agent knows how much to trust it.

Wiring your real agent into the eval harness

EvalHarness.run_suite(tasks, agent) takes any callable agent(workspace_path, task). Replace the no-op in cli.py:eval with a call into your agent (OpenCode/Codex/Hermes), point it at the copied workspace, and the harness reports resolved-rate. Track that number across config changes.

Design principles

The agent's self-report is never an input. Only ground truth counts.
Refusing to confirm is a feature. No signal → UNVERIFIED, not PASS.
Reuse, don't re-implement. Compilers/type-checkers are better oracles than anything we'd hand-roll.
Measure everything. If the eval number doesn't move, the feature didn't help — regardless of how clever it is.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
examples		examples
src/sin_code_oracle		src/sin_code_oracle
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SIN-Code Verification Oracle (Repo 6)

Why this matters

Components

Install

Usage

MCP integration

Wiring your real agent into the eval harness

Design principles

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SIN-Code Verification Oracle (Repo 6)

Why this matters

Components

Install

Usage

MCP integration

Wiring your real agent into the eval harness

Design principles

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages