Skip to content

Heterogeneous multi-graph view: inheritance / type-use / import / field-rw graphs with joins, polyglot edges across languages #156

@rahlk

Description

@rahlk

Summary

Promote CLDK from "the call graph + a few accessors" to a first-class heterogeneous multi-graph substrate: every relation in a codebase modelled as a typed graph, all sharing a node namespace, joinable against each other, and followable across language boundaries through explicit bridge edges (HTTP routes, RPC service definitions, message-bus topics, FFI declarations, ORM ↔ SQL, config-file references).

This is the substrate that the composable chain API in #155 queries against. #155 proposed the surface; this issue proposes the data model underneath that makes the chains meaningful for any non-trivial relation.

The graphs

Today CLDK exposes essentially one graph (call) plus inventory accessors from which other relations can be derived ad-hoc. Each of the following should be a first-class graph with its own pa.<graph_name>() accessor:

Graph Nodes Edges Today
Call callables "A calls B" exists (get_call_graph)
Inheritance classes/interfaces "A extends/implements B" derivable, not first-class
Type-use types, callables, fields "callable C declares/returns/accepts type T"; "field F has type T" not exposed
Module-import modules/packages "module M imports module N" (with symbol granularity where the language has it) partial via get_imports
Field read/write fields/globals × callables "callable C reads field F" / "writes field F" not exposed
Decorator / annotation callables/classes × decorators "C is decorated D with kwargs K" partial via .decorators attribute
Exception-flow callables × exception types "C raises E"; "C catches E" not exposed
Configuration-reference code symbols × config keys "callable C reads config key K"; "key K defined in file F" not exposed
Test-link tests × prod callables "test T exercises callable C" (mined from imports + coverage where available) not exposed
Resource graph I/O sites × resources "callable C opens file/socket/db connection X" not exposed

Every graph shares the same node namespace wherever possible, so a callable referenced in the call graph IS the same node referenced in the type-use graph, the field-rw graph, the decorator graph, etc. This is the whole point — it makes joins natural.

Joins across graphs (homogeneous, within one language)

The killer queries live in the joins. Examples that should be one expression each:

  • "Every public-API method (decorator graph) that calls a deprecated method (call graph) whose return type is a database model class (type-use graph)" — refactoring impact
  • "Every method that writes field password (field-rw graph) but does not catch EncryptionError (exception-flow graph)" — security gap
  • "Every controller class (inheritance graph) whose method reads config key enable_legacy_* (config-ref graph) but has no test coverage (test-link graph)" — risk-prioritized review
  • "Every module (import graph) that depends on package foo AND exports a class implementing interface Bar (inheritance graph)" — license / supply-chain audit

Joins like this are how you turn "I have eight graphs" into "I have a code-analysis substrate." The chain API in #155 is the surface that makes them ergonomic; without the multi-graph substrate the chain is stuck on calls.

Cross-language: the bridge edges (the real prize)

Real codebases are polyglot. The Odoo audit I just did had Python controllers, XML view definitions, JS frontend hooks, SQL access lists, and YAML config — and the most interesting trust boundaries were between languages, not within them. The current state of the art is the analyst stitching these together in their head.

A cross-language multi-graph substrate models the bridging edges explicitly:

Bridge Source side Target side Linking signal
HTTP route server handler decorated/declared with URL path client fetch/axios/requests call with URL string URL string match (or schema like OpenAPI/Swagger when present)
RPC service server-side service method (gRPC/Thrift/SOAP) client stub call service+method name from .proto / IDL
Message bus publisher (Kafka/SQS/RabbitMQ/NATS) subscriber/handler topic/queue name
ORM ↔ SQL ORM model class + field declarations SQL DDL / migration / hand-written query table+column names
FFI / shared library native declaration bindings on the other side symbol name + ABI
Config file ↔ code code site reading a key YAML/JSON/TOML/INI/ENV file defining the key key path
Template ↔ code template variable view/handler passing it variable name in scope
Build artifact source files of one component declared input/output of build step filename / target name in Makefile/Bazel/etc.
Container ↔ binary Dockerfile CMD/ENTRYPOINT program entry point path / image layer
Schema ↔ deserialiser JSON Schema / Protobuf / Avro / OpenAPI parse call site schema reference / mime type

The substrate should:

  1. Detect these bridges automatically using framework recipes (Flask routes, FastAPI, Express, Spring, gRPC .proto, OpenAPI specs, common ORM patterns, common message-bus client libs).
  2. Allow user-declared bridges for project-specific conventions.
  3. Expose them as typed edges in the multi-graph so existing chain queries (reachable_to, callers, callees) traverse them transparently when the user opts in (via=[\"call\", \"http\", \"rpc\"]).
  4. Carry a confidence label per bridge edge (see next section).

Honest visibility: heuristic linking must be visible

Cross-language linking is mostly heuristic — a URL string in a JS file is matched to a Python route by string equality and parameter-shape compatibility, not by static proof. The substrate must not pretend otherwise.

Every bridge edge carries:

  • bridge_type: http_route / rpc_service / message_topic / orm_table / ffi_symbol / config_key / ...
  • confidence: static_proof (e.g. resolved via a typed schema like OpenAPI/protobuf), string_match (URL or topic literal matched), heuristic (name similarity), manual (user-declared)
  • evidence: the literal/schema reference that produced the link
  • direction: who sends, who receives

Same visibility model as the within-language graphs (resolved / structural / unresolved from #155), extended across the language boundary. The analyst's confidence tier follows mechanically from the weakest link in the chain.

Why this matters more for security than for refactoring

For refactoring, within-language graphs already cover ~80% of the value because most refactors are within one component. The polyglot story is nice-to-have.

For security audits, polyglot is the prize. The real attack surface in a modern app is the seams: untrusted JSON crossing from JS to Python, a topic name shared between two services, a config file flag that disables auth on a route. Today the analyst stitches these seams from grep + memory; a substrate that exposes them as typed graph edges, joinable with the call/type/field graphs on either side, would be transformative.

Concretely, the Odoo audit I just published would extend to:

  • Cross-reference each Python controller route against the XML view definitions that declare which actions/clients invoke them
  • Cross-reference each @http.route against the JS frontend to see which routes are actually called from the SPA and which are orphans
  • Cross-reference each model field against the ir.model.access.csv ACL and ir.rule record rules to compute the actual effective permission on the field for each user role

None of that is doable today. All of it is one query each on top of a multi-graph substrate.

Relationship to #155

They are complementary and roughly independent. Either can ship first; both together are what makes CLDK a general code-analysis substrate (the "pandas of code analysis" framing).

Suggested incremental rollout

The full vision is big; the incremental path is small:

  1. Promote within-language graphs to first-class. Inheritance, type-use, module-import, field-rw, decorator, exception-flow each get a pa.<graph_name>() accessor returning a graph with a stable node-id scheme shared with the call graph. (Most of these are derivable from the existing analysis; the work is exposing them, not recomputing.)
  2. Add cross-graph joins on shared node ids. No new analysis; just the surface so users can ask pa.field_rw().writes_of(\"password\").intersect(pa.exception_flow().not_catching(\"EncryptionError\")).
  3. Ship two cross-language bridge types as the proof of concept: HTTP route (server↔client) and ORM↔SQL. Each driven by a framework recipe (Flask/FastAPI/Django/Express + SQLAlchemy/Django-ORM/Sequelize). Honest confidence labels from day one.
  4. Open a recipe registry so the community can contribute additional bridge detectors (gRPC, Kafka, NATS, Spring, Rails, etc.). Recipes ship as data, not code.

Step 1 alone is a big win and unlocks #155's chain API to operate on more than the call graph. Step 3 is what makes CLDK uniquely valuable for polyglot security audits.

Out of scope (separate issues)

The framing

If CLDK is going to be the pandas of code analysis (the broader thesis from the discussion these issues came out of), this is the equivalent of pandas moving from "Series + DataFrame" to "DataFrame + MultiIndex + merge + groupby + cross-table joins." A single graph is a Series. The multi-graph substrate with shared node ids and cross-language bridges is the DataFrame join — which is where pandas went from useful to indispensable.


Context: same Odoo audit and poe-with-cldk skill methodology that motivated #155. Within one Python file, the call graph carried me most of the way; the next step up — auditing the seams between the Python controllers, the XML actions, the JS frontend, the ACL CSVs, and the SQL access patterns — runs into a wall today because there is no substrate that represents those relations as joinable typed graphs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions