Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
84 changes: 84 additions & 0 deletions docs/plans/operational-hardening.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Plan: Operational Hardening

Status: proposed
Owner: unassigned
Scope: all 7 SIN-Code repositories (SCKG, IBD, POC, EFSM, ADW, Verification-Oracle, Bundle)

## Motivation

The stack is functionally complete and every repo has a green local `pytest`
run, but it is not yet operationally production-ready. There is no automated
verification on push, no release process, and no guard against the repos
drifting out of sync. This plan covers the concrete, near-term engineering work
needed to make the stack trustworthy to install and contribute to.

This plan deliberately contains **no new features** — only CI, release,
packaging, and consistency work.

## Workstreams

### WS1 — Continuous Integration (per repo)
Add a GitHub Actions workflow (`.github/workflows/ci.yml`) to every repo that:
- runs on `push` and `pull_request`
- sets up a matrix of Python 3.11 / 3.12 / 3.13
- installs the package with `pip install -e .`
- runs `pytest -q`
- (where the repo declares optional extras) installs them and re-runs

Acceptance:
- A green check is required on every PR before merge.
- A failing test blocks the merge.

### WS2 — Lint & format gate
Adopt `ruff` (lint + format) across all repos with a single shared config.
- Add `ruff` to a `[project.optional-dependencies] dev` group.
- Add a `lint` job to CI: `ruff check .` and `ruff format --check .`.
- Fix existing violations in one mechanical commit per repo.

Acceptance:
- `ruff check .` is clean on every repo.

### WS3 — Release & packaging
- Add a `release.yml` workflow triggered on tag `v*` that builds an sdist +
wheel (`python -m build`) and attaches them to a GitHub Release.
- Verify each `pyproject.toml` has correct metadata: `license`, `authors`,
`readme`, `classifiers`, `urls` (Homepage/Repository/Issues).
- Decide and document a versioning policy (SemVer, already noted in CHANGELOGs).

Acceptance:
- Pushing a tag produces a downloadable wheel + sdist per repo.
- `pip install <wheel>` works in a clean environment.

### WS4 — Cross-repo consistency check
The Bundle depends on the 6 subsystems via local path installs. Add a small
script (`scripts/check_consistency.py` in the Bundle) that asserts:
- every subsystem package version matches the Bundle's expectation
- every subsystem exposes the CLI entry point the Bundle's `status` command
probes for
- the MCP tool names advertised by `sin mcp-config` match the tools actually
registered in each `mcp_server.py`

Wire it into the Bundle's CI as a non-blocking (warning) job first, then
promote to blocking once green.

Acceptance:
- `python scripts/check_consistency.py` exits 0 against the current repos.

### WS5 — Editable-install developer bootstrap
Add a documented one-command dev setup (the multi-repo `pip install -e` loop
from the Bundle README) as `scripts/dev_install.sh`, plus a matching
`scripts/run_all_tests.sh` that iterates the repos and aggregates results.

Acceptance:
- A fresh clone of all repos can be set up and fully tested with two commands.

## Sequencing

WS1 and WS2 are independent and can land first (per repo, in parallel).
WS3 depends on WS1 being green. WS4 and WS5 live only in the Bundle and depend
on WS1.

## Out of scope

- Any new runtime feature or tool.
- The larger architectural items tracked in `docs/plans/sota-roadmap.md`.
103 changes: 103 additions & 0 deletions docs/plans/sota-roadmap.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Plan: SOTA Roadmap

Status: proposed
Owner: unassigned
Scope: cross-repo, medium-term

## Motivation

The stack currently leans heavily on static analysis. The highest-leverage
improvements for real-world agent coding quality lie in execution feedback,
verification oracles, budget-aware context selection, and security. This plan
captures those larger items as discrete, independently shippable workstreams so
they can be prioritized and assigned over time.

Each workstream is intentionally coarse; it will be broken into smaller issues
when picked up.

## Workstreams

### WS1 — Compiler/LSP as a primary correctness oracle (SCKG + Oracle)
Today the Verification-Oracle shells out to linters and runs tests. The
strongest, cheapest correctness signal — the compiler/type checker — is only
partially used, and SCKG re-implements navigation that Language Servers already
provide.
- Integrate a Language Server (pyright/tsserver/gopls) behind a stable adapter.
- Use LSP diagnostics as a first-class oracle signal.
- Let SCKG consume LSP `definition`/`references` instead of hand-rolled call
resolution where available.

Value: high. Risk: medium (LSP lifecycle management).

### WS2 — Budget-aware context selection (SCKG)
SCKG can build the graph and answer `impact`/`downstream`, but cannot answer
"what is the minimal context this task needs under this token budget?".
- Add a relevance ranking (graph centrality + recency + task affinity).
- Add a `select-context --budget N` command returning a ranked, truncated set.

Value: high. Risk: medium.

### WS3 — Behavioral trace diffing (IBD + Oracle)
IBD diffs ASTs; it does not show what *behavior* changed. The Oracle already
has a trace-diff primitive — connect them.
- Capture execution traces (or structured logs) before/after a change.
- Produce a behavioral diff alongside the AST/intent diff.

Value: high. Risk: medium-high (trace capture is environment-specific).

### WS4 — Security scanning (ADW or new module)
No SAST, secret detection, dependency-vuln (SCA), or license checks exist.
- Integrate secret scanning, dependency vulnerability scanning, and a SAST pass.
- Surface findings through ADW's debt/score reporting and the Bundle CLI.

Value: high (production blocker). Risk: low-medium.

### WS5 — Eval harness expansion (Oracle)
The Oracle has a SWE-bench-style harness skeleton. Make it the meta-tool that
proves the rest of the stack helps.
- Add a curated task suite + scoring.
- Add regression evals runnable in CI.
- Report pass-rate deltas between versions.

Value: high (without this, "SOTA" is unmeasurable). Risk: medium.

### WS6 — Incremental graph updates (SCKG)
The graph is rebuilt from scratch each time, which will not scale.
- Add file-watcher / changed-files incremental updates.
- Persist and invalidate per-file subgraphs.

Value: medium-high. Risk: medium.

### WS7 — Polyglot parity (SCKG + IBD)
Parsing/diffing is Python-centric. Real repos are polyglot.
- Bring JS/TS (and one more language) to parity in both SCKG and IBD.
- Share a language-capability matrix in docs.

Value: medium. Risk: medium.

### WS8 — Persistent cross-task learning (Bundle)
Nothing remembers which solutions/patterns worked or failed.
- Add a lightweight store of task outcomes keyed by repo + task signature.
- Expose retrieval as an MCP tool the agent can consult.

Value: medium. Risk: medium-high (design-heavy).

### WS9 — Semantic merge for parallel agents (new)
When multiple agents work in parallel, naive merges conflict.
- Prototype conflict-aware merging at the symbol level (reuse IBD's AST diff).

Value: medium. Risk: high (research-y).

## Prioritization (suggested)

1. WS4 Security (production blocker, low risk)
2. WS5 Eval harness (makes everything else measurable)
3. WS1 LSP/compiler oracle (highest correctness leverage)
4. WS2 Budget-aware context
5. WS3 Behavioral trace diff
6. WS6 / WS7 (scaling & breadth)
7. WS8 / WS9 (research-heavy)

## Out of scope

- Operational/CI work — tracked in `docs/plans/operational-hardening.md`.