Codegraph

The codegraph is your organization's code, modelled as a graph instead of a pile of files. When Unyform analyzes a repo, it doesn't just store text โ€” it extracts every symbol and the relationships between them: who calls what, what implements which trait, which handler answers a route, which function in this repo reaches across to an endpoint in that one. The result is a typed, queryable graph that the planning agent, the gateway, and your dashboard all read from.

Why a graph

Grep finds strings. It can't tell you that charge_card is called by seven handlers, that three of them live in a different repository, or that your team has quietly standardized on TextField over four older input components. Those are relationships, and relationships are exactly what a graph stores natively.

Note

The audience for the codegraph is AI agents and governance, not humans typing code. It's not an IDE feature โ€” it's the substrate the gateway queries at request time and the planning agent reads instead of guessing from file dumps.

What's in the graph

Every node and edge carries an org_id, so the graph is strictly per-tenant. Nodes are the things in your code; edges are how they relate.

NodesEdges
Repository, Directory, File, ModuleContains โ€” structural nesting
Function, Class, Trait, Struct, Enum, Interface, VariableCalls, Imports, Defines
Route, EndpointHandlesRoute, CallsEndpoint
Pattern, Policy, CommunityInherits, Implements
Dependency, GapFindingCrossRepoCalls โ€” call routed through an endpoint into another repo
IdiomMatchesPattern, Violates, CompliesWith, HasConcern
ConcernFollowsIdiom, DriftsFromIdiom

A few edges do the heavy lifting:

  • CrossRepoCalls links a caller in one repo to the handler in another, matched at the AST level through a shared endpoint โ€” with a confidence score (0.8 for a path+method match, 0.95 when an OpenAPI spec confirms it).
  • FollowsIdiom / DriftsFromIdiom mark whether a symbol uses the canonical artifact for its purpose or reaches for an older alternative. The drift edges double as a refactor backlog โ€” the negative space where your codebase isn't consistent yet.
  • Violates / CompliesWith turn policies into graph queries: "does this code path violate an active policy?" becomes a single traversal.

How it's built

The codegraph is built at analysis time, when you connect a repo. The ingestion pipeline runs a fixed sequence of pure phases over the cloned source.

flowchart LR
    A[scan files] --> B[classify language]
    B --> C[tree-sitter parse]
    C --> D[extract symbols & routes]
    D --> E[resolve calls,<br/>imports, inheritance]
    E --> F[SCIP enrich]
    F --> G[match patterns,<br/>mine idioms, eval policies]
    G --> H[detect communities]
    H --> I[persist to Kuzu]
    I --> J[project to Postgres<br/>snapshot]

Two things make the extraction precise:

  1. Tree-sitter parses every supported file into an AST, so symbols come from real syntax rather than regex guesses.
  2. SCIP indexers overlay semantic precision where a language provides them โ€” rust-analyzer, scip-typescript, gopls โ€” so call and reference resolution is exact, not heuristic, for Rust, TypeScript, and Go.

Resolution happens in two passes: a pre-scan builds a symbol-to-file map, then calls, imports, and inheritance chains are bound against it. Cross-repo linking runs last, matching the endpoints one repo calls against the routes another repo handles.

Tip

Supported languages: Rust, TypeScript/JavaScript, Python, Go, and Java. Languages with a SCIP indexer (Rust, TS/JS, Go) get the highest fidelity; the rest fall back to tree-sitter extraction, which is still AST-accurate.

Where it lives

The graph is stored in Kuzu, an embedded columnar graph database โ€” one physical .kuzu file per org, so tenant isolation is enforced by the filesystem, not just a WHERE org_id = ? clause. Each analysis produces a snapshot, which is then projected into Postgres for the read side. That split keeps the rich graph available for deep queries while the dashboard and gateway read fast, denormalized snapshots.

Warning

A snapshot reflects the repo at the SHA it was analyzed. Push new code and the graph is stale until the next analysis runs โ€” queries answer "how things were at the last analysis," not your uncommitted working tree.

What it powers

The graph isn't a side artifact โ€” it's the source of truth that several Unyform features resolve their content from.

Blueprints are resolved from the graph, not hand-written. Repo role, detected frameworks, dependencies and their purpose, and gap-analysis findings are all nodes the blueprint reads โ€” so when the gateway injects a blueprint, it's injecting structured facts about your code, not a prose summary that drifts.

Idiom mining clusters symbols by name pattern, module group, or co-import set, then scores each cluster to find the canonical one โ€” the symbol with the most incoming Calls/Imports edges. Clusters with โ‰ฅ70% adoption and โ‰ฅ3 uses become Idiom nodes: "the way this codebase does X," with the alternatives listed as the inconsistencies still to clean up.

Visualizations in the dashboard render straight off the graph:

  • Ecosystem โ€” repo-level overview, colored by detected community
  • Call graph โ€” function-level zoom into a single repo
  • Request flow โ€” routes โ†’ handlers โ†’ callees, including cross-repo hops

MCP tools expose the graph to external agents over a stable surface, so an agent can ask structural questions instead of reading files:

  • codegraph_impact โ€” what breaks if you change this symbol
  • codegraph_cross_repo_callers โ€” who calls this across repo boundaries
  • codegraph_route_map โ€” the route โ†’ handler topology of a service
  • codegraph_context โ€” task-relevant retrieval for LLM planning (hybrid BM25 + graph traversal)

How it ties into the gateway

At request time the gateway reads the same graph. When a chat request names a function or file, the gateway injects graph context โ€” callers, callees, relevant patterns โ€” as part of its governance injection, and turns policy enforcement into a Violates-edge lookup. The graph is what lets generated code match what already exists, because the gateway can see what already exists.

Edit this page on GitHub