Summary
Promote CLDK from "the call graph + a few accessors" to a first-class heterogeneous multi-graph substrate: every relation in a codebase modelled as a typed graph, all sharing a node namespace, joinable against each other, and followable across language boundaries through explicit bridge edges (HTTP routes, RPC service definitions, message-bus topics, FFI declarations, ORM ↔ SQL, config-file references).
This is the substrate that the composable chain API in #155 queries against. #155 proposed the surface; this issue proposes the data model underneath that makes the chains meaningful for any non-trivial relation.
The graphs
Today CLDK exposes essentially one graph (call) plus inventory accessors from which other relations can be derived ad-hoc. Each of the following should be a first-class graph with its own pa.<graph_name>() accessor:
| Graph |
Nodes |
Edges |
Today |
| Call |
callables |
"A calls B" |
exists (get_call_graph) |
| Inheritance |
classes/interfaces |
"A extends/implements B" |
derivable, not first-class |
| Type-use |
types, callables, fields |
"callable C declares/returns/accepts type T"; "field F has type T" |
not exposed |
| Module-import |
modules/packages |
"module M imports module N" (with symbol granularity where the language has it) |
partial via get_imports |
| Field read/write |
fields/globals × callables |
"callable C reads field F" / "writes field F" |
not exposed |
| Decorator / annotation |
callables/classes × decorators |
"C is decorated D with kwargs K" |
partial via .decorators attribute |
| Exception-flow |
callables × exception types |
"C raises E"; "C catches E" |
not exposed |
| Configuration-reference |
code symbols × config keys |
"callable C reads config key K"; "key K defined in file F" |
not exposed |
| Test-link |
tests × prod callables |
"test T exercises callable C" (mined from imports + coverage where available) |
not exposed |
| Resource graph |
I/O sites × resources |
"callable C opens file/socket/db connection X" |
not exposed |
Every graph shares the same node namespace wherever possible, so a callable referenced in the call graph IS the same node referenced in the type-use graph, the field-rw graph, the decorator graph, etc. This is the whole point — it makes joins natural.
Joins across graphs (homogeneous, within one language)
The killer queries live in the joins. Examples that should be one expression each:
- "Every public-API method (decorator graph) that calls a deprecated method (call graph) whose return type is a database model class (type-use graph)" — refactoring impact
- "Every method that writes field
password (field-rw graph) but does not catch EncryptionError (exception-flow graph)" — security gap
- "Every controller class (inheritance graph) whose method reads config key
enable_legacy_* (config-ref graph) but has no test coverage (test-link graph)" — risk-prioritized review
- "Every module (import graph) that depends on package
foo AND exports a class implementing interface Bar (inheritance graph)" — license / supply-chain audit
Joins like this are how you turn "I have eight graphs" into "I have a code-analysis substrate." The chain API in #155 is the surface that makes them ergonomic; without the multi-graph substrate the chain is stuck on calls.
Cross-language: the bridge edges (the real prize)
Real codebases are polyglot. The Odoo audit I just did had Python controllers, XML view definitions, JS frontend hooks, SQL access lists, and YAML config — and the most interesting trust boundaries were between languages, not within them. The current state of the art is the analyst stitching these together in their head.
A cross-language multi-graph substrate models the bridging edges explicitly:
| Bridge |
Source side |
Target side |
Linking signal |
| HTTP route |
server handler decorated/declared with URL path |
client fetch/axios/requests call with URL string |
URL string match (or schema like OpenAPI/Swagger when present) |
| RPC service |
server-side service method (gRPC/Thrift/SOAP) |
client stub call |
service+method name from .proto / IDL |
| Message bus |
publisher (Kafka/SQS/RabbitMQ/NATS) |
subscriber/handler |
topic/queue name |
| ORM ↔ SQL |
ORM model class + field declarations |
SQL DDL / migration / hand-written query |
table+column names |
| FFI / shared library |
native declaration |
bindings on the other side |
symbol name + ABI |
| Config file ↔ code |
code site reading a key |
YAML/JSON/TOML/INI/ENV file defining the key |
key path |
| Template ↔ code |
template variable |
view/handler passing it |
variable name in scope |
| Build artifact |
source files of one component |
declared input/output of build step |
filename / target name in Makefile/Bazel/etc. |
| Container ↔ binary |
Dockerfile CMD/ENTRYPOINT |
program entry point |
path / image layer |
| Schema ↔ deserialiser |
JSON Schema / Protobuf / Avro / OpenAPI |
parse call site |
schema reference / mime type |
The substrate should:
- Detect these bridges automatically using framework recipes (Flask routes, FastAPI, Express, Spring, gRPC
.proto, OpenAPI specs, common ORM patterns, common message-bus client libs).
- Allow user-declared bridges for project-specific conventions.
- Expose them as typed edges in the multi-graph so existing chain queries (
reachable_to, callers, callees) traverse them transparently when the user opts in (via=[\"call\", \"http\", \"rpc\"]).
- Carry a confidence label per bridge edge (see next section).
Honest visibility: heuristic linking must be visible
Cross-language linking is mostly heuristic — a URL string in a JS file is matched to a Python route by string equality and parameter-shape compatibility, not by static proof. The substrate must not pretend otherwise.
Every bridge edge carries:
bridge_type: http_route / rpc_service / message_topic / orm_table / ffi_symbol / config_key / ...
confidence: static_proof (e.g. resolved via a typed schema like OpenAPI/protobuf), string_match (URL or topic literal matched), heuristic (name similarity), manual (user-declared)
evidence: the literal/schema reference that produced the link
direction: who sends, who receives
Same visibility model as the within-language graphs (resolved / structural / unresolved from #155), extended across the language boundary. The analyst's confidence tier follows mechanically from the weakest link in the chain.
Why this matters more for security than for refactoring
For refactoring, within-language graphs already cover ~80% of the value because most refactors are within one component. The polyglot story is nice-to-have.
For security audits, polyglot is the prize. The real attack surface in a modern app is the seams: untrusted JSON crossing from JS to Python, a topic name shared between two services, a config file flag that disables auth on a route. Today the analyst stitches these seams from grep + memory; a substrate that exposes them as typed graph edges, joinable with the call/type/field graphs on either side, would be transformative.
Concretely, the Odoo audit I just published would extend to:
- Cross-reference each Python controller route against the XML view definitions that declare which actions/clients invoke them
- Cross-reference each
@http.route against the JS frontend to see which routes are actually called from the SPA and which are orphans
- Cross-reference each model field against the
ir.model.access.csv ACL and ir.rule record rules to compute the actual effective permission on the field for each user role
None of that is doable today. All of it is one query each on top of a multi-graph substrate.
Relationship to #155
They are complementary and roughly independent. Either can ship first; both together are what makes CLDK a general code-analysis substrate (the "pandas of code analysis" framing).
Suggested incremental rollout
The full vision is big; the incremental path is small:
- Promote within-language graphs to first-class. Inheritance, type-use, module-import, field-rw, decorator, exception-flow each get a
pa.<graph_name>() accessor returning a graph with a stable node-id scheme shared with the call graph. (Most of these are derivable from the existing analysis; the work is exposing them, not recomputing.)
- Add cross-graph joins on shared node ids. No new analysis; just the surface so users can ask
pa.field_rw().writes_of(\"password\").intersect(pa.exception_flow().not_catching(\"EncryptionError\")).
- Ship two cross-language bridge types as the proof of concept: HTTP route (server↔client) and ORM↔SQL. Each driven by a framework recipe (Flask/FastAPI/Django/Express + SQLAlchemy/Django-ORM/Sequelize). Honest confidence labels from day one.
- Open a recipe registry so the community can contribute additional bridge detectors (gRPC, Kafka, NATS, Spring, Rails, etc.). Recipes ship as data, not code.
Step 1 alone is a big win and unlocks #155's chain API to operate on more than the call graph. Step 3 is what makes CLDK uniquely valuable for polyglot security audits.
Out of scope (separate issues)
The framing
If CLDK is going to be the pandas of code analysis (the broader thesis from the discussion these issues came out of), this is the equivalent of pandas moving from "Series + DataFrame" to "DataFrame + MultiIndex + merge + groupby + cross-table joins." A single graph is a Series. The multi-graph substrate with shared node ids and cross-language bridges is the DataFrame join — which is where pandas went from useful to indispensable.
Context: same Odoo audit and poe-with-cldk skill methodology that motivated #155. Within one Python file, the call graph carried me most of the way; the next step up — auditing the seams between the Python controllers, the XML actions, the JS frontend, the ACL CSVs, and the SQL access patterns — runs into a wall today because there is no substrate that represents those relations as joinable typed graphs.
Summary
Promote CLDK from "the call graph + a few accessors" to a first-class heterogeneous multi-graph substrate: every relation in a codebase modelled as a typed graph, all sharing a node namespace, joinable against each other, and followable across language boundaries through explicit bridge edges (HTTP routes, RPC service definitions, message-bus topics, FFI declarations, ORM ↔ SQL, config-file references).
This is the substrate that the composable chain API in #155 queries against. #155 proposed the surface; this issue proposes the data model underneath that makes the chains meaningful for any non-trivial relation.
The graphs
Today CLDK exposes essentially one graph (call) plus inventory accessors from which other relations can be derived ad-hoc. Each of the following should be a first-class graph with its own
pa.<graph_name>()accessor:get_call_graph)get_imports.decoratorsattributeEvery graph shares the same node namespace wherever possible, so a callable referenced in the call graph IS the same node referenced in the type-use graph, the field-rw graph, the decorator graph, etc. This is the whole point — it makes joins natural.
Joins across graphs (homogeneous, within one language)
The killer queries live in the joins. Examples that should be one expression each:
password(field-rw graph) but does not catchEncryptionError(exception-flow graph)" — security gapenable_legacy_*(config-ref graph) but has no test coverage (test-link graph)" — risk-prioritized reviewfooAND exports a class implementing interfaceBar(inheritance graph)" — license / supply-chain auditJoins like this are how you turn "I have eight graphs" into "I have a code-analysis substrate." The chain API in #155 is the surface that makes them ergonomic; without the multi-graph substrate the chain is stuck on calls.
Cross-language: the bridge edges (the real prize)
Real codebases are polyglot. The Odoo audit I just did had Python controllers, XML view definitions, JS frontend hooks, SQL access lists, and YAML config — and the most interesting trust boundaries were between languages, not within them. The current state of the art is the analyst stitching these together in their head.
A cross-language multi-graph substrate models the bridging edges explicitly:
fetch/axios/requestscall with URL string.proto/ IDLCMD/ENTRYPOINTThe substrate should:
.proto, OpenAPI specs, common ORM patterns, common message-bus client libs).reachable_to,callers,callees) traverse them transparently when the user opts in (via=[\"call\", \"http\", \"rpc\"]).Honest visibility: heuristic linking must be visible
Cross-language linking is mostly heuristic — a URL string in a JS file is matched to a Python route by string equality and parameter-shape compatibility, not by static proof. The substrate must not pretend otherwise.
Every bridge edge carries:
bridge_type:http_route/rpc_service/message_topic/orm_table/ffi_symbol/config_key/ ...confidence:static_proof(e.g. resolved via a typed schema like OpenAPI/protobuf),string_match(URL or topic literal matched),heuristic(name similarity),manual(user-declared)evidence: the literal/schema reference that produced the linkdirection: who sends, who receivesSame visibility model as the within-language graphs (resolved / structural / unresolved from #155), extended across the language boundary. The analyst's confidence tier follows mechanically from the weakest link in the chain.
Why this matters more for security than for refactoring
For refactoring, within-language graphs already cover ~80% of the value because most refactors are within one component. The polyglot story is nice-to-have.
For security audits, polyglot is the prize. The real attack surface in a modern app is the seams: untrusted JSON crossing from JS to Python, a topic name shared between two services, a config file flag that disables auth on a route. Today the analyst stitches these seams from grep + memory; a substrate that exposes them as typed graph edges, joinable with the call/type/field graphs on either side, would be transformative.
Concretely, the Odoo audit I just published would extend to:
@http.routeagainst the JS frontend to see which routes are actually called from the SPA and which are orphansir.model.access.csvACL andir.rulerecord rules to compute the actual effective permission on the field for each user roleNone of that is doable today. All of it is one query each on top of a multi-graph substrate.
Relationship to #155
pa.callables().with_decorator(...).reachable_to(...).without_passing_through(...)#155 is the surface: composable chains over CLDK's graphs.They are complementary and roughly independent. Either can ship first; both together are what makes CLDK a general code-analysis substrate (the "pandas of code analysis" framing).
Suggested incremental rollout
The full vision is big; the incremental path is small:
pa.<graph_name>()accessor returning a graph with a stable node-id scheme shared with the call graph. (Most of these are derivable from the existing analysis; the work is exposing them, not recomputing.)pa.field_rw().writes_of(\"password\").intersect(pa.exception_flow().not_catching(\"EncryptionError\")).Step 1 alone is a big win and unlocks #155's chain API to operate on more than the call graph. Step 3 is what makes CLDK uniquely valuable for polyglot security audits.
Out of scope (separate issues)
pa.callables().with_decorator(...).reachable_to(...).without_passing_through(...)#155)The framing
If CLDK is going to be the pandas of code analysis (the broader thesis from the discussion these issues came out of), this is the equivalent of pandas moving from "Series + DataFrame" to "DataFrame + MultiIndex + merge + groupby + cross-table joins." A single graph is a Series. The multi-graph substrate with shared node ids and cross-language bridges is the DataFrame join — which is where pandas went from useful to indispensable.
Context: same Odoo audit and
poe-with-cldkskill methodology that motivated #155. Within one Python file, the call graph carried me most of the way; the next step up — auditing the seams between the Python controllers, the XML actions, the JS frontend, the ACL CSVs, and the SQL access patterns — runs into a wall today because there is no substrate that represents those relations as joinable typed graphs.