OpenSIN-Code · Delqhi · May 30, 2026 · May 30, 2026
diff --git a/docs/plans/operational-hardening.md b/docs/plans/operational-hardening.md
@@ -0,0 +1,84 @@
+# Plan: Operational Hardening
+
+Status: proposed
+Owner: unassigned
+Scope: all 7 SIN-Code repositories (SCKG, IBD, POC, EFSM, ADW, Verification-Oracle, Bundle)
+
+## Motivation
+
+The stack is functionally complete and every repo has a green local `pytest`
+run, but it is not yet operationally production-ready. There is no automated
+verification on push, no release process, and no guard against the repos
+drifting out of sync. This plan covers the concrete, near-term engineering work
+needed to make the stack trustworthy to install and contribute to.
+
+This plan deliberately contains **no new features** — only CI, release,
+packaging, and consistency work.
+
+## Workstreams
+
+### WS1 — Continuous Integration (per repo)
+Add a GitHub Actions workflow (`.github/workflows/ci.yml`) to every repo that:
+- runs on `push` and `pull_request`
+- sets up a matrix of Python 3.11 / 3.12 / 3.13
+- installs the package with `pip install -e .`
+- runs `pytest -q`
+- (where the repo declares optional extras) installs them and re-runs
+
+Acceptance:
+- A green check is required on every PR before merge.
+- A failing test blocks the merge.
+
+### WS2 — Lint & format gate
+Adopt `ruff` (lint + format) across all repos with a single shared config.
+- Add `ruff` to a `[project.optional-dependencies] dev` group.
+- Add a `lint` job to CI: `ruff check .` and `ruff format --check .`.
+- Fix existing violations in one mechanical commit per repo.
+
+Acceptance:
+- `ruff check .` is clean on every repo.
+
+### WS3 — Release & packaging
+- Add a `release.yml` workflow triggered on tag `v*` that builds an sdist +
+  wheel (`python -m build`) and attaches them to a GitHub Release.
+- Verify each `pyproject.toml` has correct metadata: `license`, `authors`,
+  `readme`, `classifiers`, `urls` (Homepage/Repository/Issues).
+- Decide and document a versioning policy (SemVer, already noted in CHANGELOGs).
+
+Acceptance:
+- Pushing a tag produces a downloadable wheel + sdist per repo.
+- `pip install <wheel>` works in a clean environment.
+
+### WS4 — Cross-repo consistency check
+The Bundle depends on the 6 subsystems via local path installs. Add a small
+script (`scripts/check_consistency.py` in the Bundle) that asserts:
+- every subsystem package version matches the Bundle's expectation
+- every subsystem exposes the CLI entry point the Bundle's `status` command
+  probes for
+- the MCP tool names advertised by `sin mcp-config` match the tools actually
+  registered in each `mcp_server.py`
+
+Wire it into the Bundle's CI as a non-blocking (warning) job first, then
+promote to blocking once green.
+
+Acceptance:
+- `python scripts/check_consistency.py` exits 0 against the current repos.
+
+### WS5 — Editable-install developer bootstrap
+Add a documented one-command dev setup (the multi-repo `pip install -e` loop
+from the Bundle README) as `scripts/dev_install.sh`, plus a matching
+`scripts/run_all_tests.sh` that iterates the repos and aggregates results.
+
+Acceptance:
+- A fresh clone of all repos can be set up and fully tested with two commands.
+
+## Sequencing
+
+WS1 and WS2 are independent and can land first (per repo, in parallel).
+WS3 depends on WS1 being green. WS4 and WS5 live only in the Bundle and depend
+on WS1.
+
+## Out of scope
+
+- Any new runtime feature or tool.
+- The larger architectural items tracked in `docs/plans/sota-roadmap.md`.
diff --git a/docs/plans/sota-roadmap.md b/docs/plans/sota-roadmap.md
@@ -0,0 +1,103 @@
+# Plan: SOTA Roadmap
+
+Status: proposed
+Owner: unassigned
+Scope: cross-repo, medium-term
+
+## Motivation
+
+The stack currently leans heavily on static analysis. The highest-leverage
+improvements for real-world agent coding quality lie in execution feedback,
+verification oracles, budget-aware context selection, and security. This plan
+captures those larger items as discrete, independently shippable workstreams so
+they can be prioritized and assigned over time.
+
+Each workstream is intentionally coarse; it will be broken into smaller issues
+when picked up.
+
+## Workstreams
+
+### WS1 — Compiler/LSP as a primary correctness oracle (SCKG + Oracle)
+Today the Verification-Oracle shells out to linters and runs tests. The
+strongest, cheapest correctness signal — the compiler/type checker — is only
+partially used, and SCKG re-implements navigation that Language Servers already
+provide.
+- Integrate a Language Server (pyright/tsserver/gopls) behind a stable adapter.
+- Use LSP diagnostics as a first-class oracle signal.
+- Let SCKG consume LSP `definition`/`references` instead of hand-rolled call
+  resolution where available.
+
+Value: high. Risk: medium (LSP lifecycle management).
+
+### WS2 — Budget-aware context selection (SCKG)
+SCKG can build the graph and answer `impact`/`downstream`, but cannot answer
+"what is the minimal context this task needs under this token budget?".
+- Add a relevance ranking (graph centrality + recency + task affinity).
+- Add a `select-context --budget N` command returning a ranked, truncated set.
+
+Value: high. Risk: medium.
+
+### WS3 — Behavioral trace diffing (IBD + Oracle)
+IBD diffs ASTs; it does not show what *behavior* changed. The Oracle already
+has a trace-diff primitive — connect them.
+- Capture execution traces (or structured logs) before/after a change.
+- Produce a behavioral diff alongside the AST/intent diff.
+
+Value: high. Risk: medium-high (trace capture is environment-specific).
+
+### WS4 — Security scanning (ADW or new module)
+No SAST, secret detection, dependency-vuln (SCA), or license checks exist.
+- Integrate secret scanning, dependency vulnerability scanning, and a SAST pass.
+- Surface findings through ADW's debt/score reporting and the Bundle CLI.
+
+Value: high (production blocker). Risk: low-medium.
+
+### WS5 — Eval harness expansion (Oracle)
+The Oracle has a SWE-bench-style harness skeleton. Make it the meta-tool that
+proves the rest of the stack helps.
+- Add a curated task suite + scoring.
+- Add regression evals runnable in CI.
+- Report pass-rate deltas between versions.
+
+Value: high (without this, "SOTA" is unmeasurable). Risk: medium.
+
+### WS6 — Incremental graph updates (SCKG)
+The graph is rebuilt from scratch each time, which will not scale.
+- Add file-watcher / changed-files incremental updates.
+- Persist and invalidate per-file subgraphs.
+
+Value: medium-high. Risk: medium.
+
+### WS7 — Polyglot parity (SCKG + IBD)
+Parsing/diffing is Python-centric. Real repos are polyglot.
+- Bring JS/TS (and one more language) to parity in both SCKG and IBD.
+- Share a language-capability matrix in docs.
+
+Value: medium. Risk: medium.
+
+### WS8 — Persistent cross-task learning (Bundle)
+Nothing remembers which solutions/patterns worked or failed.
+- Add a lightweight store of task outcomes keyed by repo + task signature.
+- Expose retrieval as an MCP tool the agent can consult.
+
+Value: medium. Risk: medium-high (design-heavy).
+
+### WS9 — Semantic merge for parallel agents (new)
+When multiple agents work in parallel, naive merges conflict.
+- Prototype conflict-aware merging at the symbol level (reuse IBD's AST diff).
+
+Value: medium. Risk: high (research-y).
+
+## Prioritization (suggested)
+
+1. WS4 Security (production blocker, low risk)
+2. WS5 Eval harness (makes everything else measurable)
+3. WS1 LSP/compiler oracle (highest correctness leverage)
+4. WS2 Budget-aware context
+5. WS3 Behavioral trace diff
+6. WS6 / WS7 (scaling & breadth)
+7. WS8 / WS9 (research-heavy)
+
+## Out of scope
+
+- Operational/CI work — tracked in `docs/plans/operational-hardening.md`.