Codegraph
The codegraph is your organization's code, modelled as a graph instead of a pile of files. When Unyform analyzes a repo, it doesn't just store text โ it extracts every symbol and the relationships between them: who calls what, what implements which trait, which handler answers a route, which function in this repo reaches across to an endpoint in that one. The result is a typed, queryable graph that the planning agent, the gateway, and your dashboard all read from.
Why a graph
Grep finds strings. It can't tell you that charge_card is called by seven
handlers, that three of them live in a different repository, or that your team
has quietly standardized on TextField over four older input components. Those
are relationships, and relationships are exactly what a graph stores natively.
Note
The audience for the codegraph is AI agents and governance, not humans typing code. It's not an IDE feature โ it's the substrate the gateway queries at request time and the planning agent reads instead of guessing from file dumps.
What's in the graph
Every node and edge carries an org_id, so the graph is strictly per-tenant.
Nodes are the things in your code; edges are how they relate.
| Nodes | Edges |
|---|---|
Repository, Directory, File, Module | Contains โ structural nesting |
Function, Class, Trait, Struct, Enum, Interface, Variable | Calls, Imports, Defines |
Route, Endpoint | HandlesRoute, CallsEndpoint |
Pattern, Policy, Community | Inherits, Implements |
Dependency, GapFinding | CrossRepoCalls โ call routed through an endpoint into another repo |
Idiom | MatchesPattern, Violates, CompliesWith, HasConcern |
Concern | FollowsIdiom, DriftsFromIdiom |
A few edges do the heavy lifting:
CrossRepoCallslinks a caller in one repo to the handler in another, matched at the AST level through a shared endpoint โ with a confidence score (0.8 for a path+method match, 0.95 when an OpenAPI spec confirms it).FollowsIdiom/DriftsFromIdiommark whether a symbol uses the canonical artifact for its purpose or reaches for an older alternative. The drift edges double as a refactor backlog โ the negative space where your codebase isn't consistent yet.Violates/CompliesWithturn policies into graph queries: "does this code path violate an active policy?" becomes a single traversal.
How it's built
The codegraph is built at analysis time, when you connect a repo. The ingestion pipeline runs a fixed sequence of pure phases over the cloned source.
flowchart LR
A[scan files] --> B[classify language]
B --> C[tree-sitter parse]
C --> D[extract symbols & routes]
D --> E[resolve calls,<br/>imports, inheritance]
E --> F[SCIP enrich]
F --> G[match patterns,<br/>mine idioms, eval policies]
G --> H[detect communities]
H --> I[persist to Kuzu]
I --> J[project to Postgres<br/>snapshot]
Two things make the extraction precise:
- Tree-sitter parses every supported file into an AST, so symbols come from real syntax rather than regex guesses.
- SCIP indexers overlay semantic precision where a language provides them โ
rust-analyzer,scip-typescript,goplsโ so call and reference resolution is exact, not heuristic, for Rust, TypeScript, and Go.
Resolution happens in two passes: a pre-scan builds a symbol-to-file map, then calls, imports, and inheritance chains are bound against it. Cross-repo linking runs last, matching the endpoints one repo calls against the routes another repo handles.
Tip
Supported languages: Rust, TypeScript/JavaScript, Python, Go, and Java. Languages with a SCIP indexer (Rust, TS/JS, Go) get the highest fidelity; the rest fall back to tree-sitter extraction, which is still AST-accurate.
Where it lives
The graph is stored in Kuzu, an embedded columnar graph database โ one
physical .kuzu file per org, so tenant isolation is enforced by the
filesystem, not just a WHERE org_id = ? clause. Each analysis produces a
snapshot, which is then projected into Postgres for the read side. That
split keeps the rich graph available for deep queries while the dashboard and
gateway read fast, denormalized snapshots.
Warning
A snapshot reflects the repo at the SHA it was analyzed. Push new code and the graph is stale until the next analysis runs โ queries answer "how things were at the last analysis," not your uncommitted working tree.
What it powers
The graph isn't a side artifact โ it's the source of truth that several Unyform features resolve their content from.
Blueprints are resolved from the graph, not hand-written. Repo role, detected frameworks, dependencies and their purpose, and gap-analysis findings are all nodes the blueprint reads โ so when the gateway injects a blueprint, it's injecting structured facts about your code, not a prose summary that drifts.
Idiom mining clusters symbols by name pattern, module group, or co-import
set, then scores each cluster to find the canonical one โ the symbol with the
most incoming Calls/Imports edges. Clusters with โฅ70% adoption and โฅ3 uses
become Idiom nodes: "the way this codebase does X," with the alternatives
listed as the inconsistencies still to clean up.
Visualizations in the dashboard render straight off the graph:
- Ecosystem โ repo-level overview, colored by detected community
- Call graph โ function-level zoom into a single repo
- Request flow โ routes โ handlers โ callees, including cross-repo hops
MCP tools expose the graph to external agents over a stable surface, so an agent can ask structural questions instead of reading files:
codegraph_impactโ what breaks if you change this symbolcodegraph_cross_repo_callersโ who calls this across repo boundariescodegraph_route_mapโ the route โ handler topology of a servicecodegraph_contextโ task-relevant retrieval for LLM planning (hybrid BM25 + graph traversal)
How it ties into the gateway
At request time the gateway reads the same graph. When
a chat request names a function or file, the gateway injects graph context โ
callers, callees, relevant patterns โ as part of its governance injection, and
turns policy enforcement into a Violates-edge lookup. The graph is what lets
generated code match what already exists, because the gateway can see what
already exists.