Codegraph

The codegraph is your organization's code, modelled as a graph instead of a pile of files. When Unyform analyzes a repo, it doesn't just store text — it extracts every symbol and the relationships between them: who calls what, what implements which trait, which handler answers a route, which function in this repo reaches across to an endpoint in that one. The result is a typed, queryable graph that the planning agent, the gateway, and your dashboard all read from.

Why a graph

Grep finds strings. It can't tell you that charge_card is called by seven handlers, that three of them live in a different repository, or that your team has quietly standardized on TextField over four older input components. Those are relationships, and relationships are exactly what a graph stores natively.

Note

The audience for the codegraph is AI agents and governance, not humans typing code. It's not an IDE feature — it's the substrate the gateway queries at request time and the planning agent reads instead of guessing from file dumps.

What's in the graph

Every node and edge carries an org_id, so the graph is strictly per-tenant. Nodes are the things in your code; edges are how they relate.

Nodes	Edges
`Repository`, `Directory`, `File`, `Module`	`Contains` — structural nesting
`Function`, `Class`, `Trait`, `Struct`, `Enum`, `Interface`, `Variable`	`Calls`, `Imports`, `Defines`
`Route`, `Endpoint`	`HandlesRoute`, `CallsEndpoint`
`Pattern`, `Policy`, `Community`	`Inherits`, `Implements`
`Dependency`, `GapFinding`	`CrossRepoCalls` — call routed through an endpoint into another repo
`Idiom`	`MatchesPattern`, `Violates`, `CompliesWith`, `HasConcern`
`Concern`	`FollowsIdiom`, `DriftsFromIdiom`

A few edges do the heavy lifting:

CrossRepoCalls links a caller in one repo to the handler in another, matched at the AST level through a shared endpoint — with a confidence score (0.8 for a path+method match, 0.95 when an OpenAPI spec confirms it).
FollowsIdiom / DriftsFromIdiom mark whether a symbol uses the canonical artifact for its purpose or reaches for an older alternative. The drift edges double as a refactor backlog — the negative space where your codebase isn't consistent yet.
Violates / CompliesWith turn policies into graph queries: "does this code path violate an active policy?" becomes a single traversal.

How it's built

The codegraph is built at analysis time, when you connect a repo. The ingestion pipeline runs a fixed sequence of pure phases over the cloned source.

flowchart LR
    A[scan files] --> B[classify language]
    B --> C[tree-sitter parse]
    C --> D[extract symbols & routes]
    D --> E[resolve calls,<br/>imports, inheritance]
    E --> F[SCIP enrich]
    F --> G[match patterns,<br/>mine idioms, eval policies]
    G --> H[detect communities]
    H --> I[persist to Kuzu]
    I --> J[project to Postgres<br/>snapshot]

Two things make the extraction precise:

Tree-sitter parses every supported file into an AST, so symbols come from real syntax rather than regex guesses.
SCIP indexers overlay semantic precision where a language provides them — rust-analyzer, scip-typescript, gopls — so call and reference resolution is exact, not heuristic, for Rust, TypeScript, and Go.

Resolution happens in two passes: a pre-scan builds a symbol-to-file map, then calls, imports, and inheritance chains are bound against it. Cross-repo linking runs last, matching the endpoints one repo calls against the routes another repo handles.

Tip

Supported languages: Rust, TypeScript/JavaScript, Python, Go, and Java. Languages with a SCIP indexer (Rust, TS/JS, Go) get the highest fidelity; the rest fall back to tree-sitter extraction, which is still AST-accurate.

Where it lives

The graph is stored in Kuzu, an embedded columnar graph database — one physical .kuzu file per org, so tenant isolation is enforced by the filesystem, not just a WHERE org_id = ? clause. Each analysis produces a snapshot, which is then projected into Postgres for the read side. That split keeps the rich graph available for deep queries while the dashboard and gateway read fast, denormalized snapshots.

Warning

A snapshot reflects the repo at the SHA it was analyzed. Push new code and the graph is stale until the next analysis runs — queries answer "how things were at the last analysis," not your uncommitted working tree.

What it powers

The graph isn't a side artifact — it's the source of truth that several Unyform features resolve their content from.

Blueprints are resolved from the graph, not hand-written. Repo role, detected frameworks, dependencies and their purpose, and gap-analysis findings are all nodes the blueprint reads — so when the gateway injects a blueprint, it's injecting structured facts about your code, not a prose summary that drifts.

Idiom mining clusters symbols by name pattern, module group, or co-import set, then scores each cluster to find the canonical one — the symbol with the most incoming Calls/Imports edges. Clusters with ≥70% adoption and ≥3 uses become Idiom nodes: "the way this codebase does X," with the alternatives listed as the inconsistencies still to clean up.

Visualizations in the dashboard render straight off the graph:

Ecosystem — repo-level overview, colored by detected community
Call graph — function-level zoom into a single repo
Request flow — routes → handlers → callees, including cross-repo hops

MCP tools expose the graph to external agents over a stable surface, so an agent can ask structural questions instead of reading files:

codegraph_impact — what breaks if you change this symbol
codegraph_cross_repo_callers — who calls this across repo boundaries
codegraph_route_map — the route → handler topology of a service
codegraph_context — task-relevant retrieval for LLM planning (hybrid BM25 + graph traversal)

How it ties into the gateway

At request time the gateway reads the same graph. When a chat request names a function or file, the gateway injects graph context — callers, callees, relevant patterns — as part of its governance injection, and turns policy enforcement into a Violates-edge lookup. The graph is what lets generated code match what already exists, because the gateway can see what already exists.

Edit this page on GitHub

← Blueprints Idioms & drift →