Analysis pipeline
Analysis is the job that turns a repository into the intelligence Unyform governs with. It clones the repo at a commit, builds a codegraph, mines idioms, resolves a versioned blueprint, and writes embeddings for retrieval โ all in one pass. Nothing about your code shapes a request until it has been through this pipeline.
You start it from Dashboard โ Repositories by clicking Analyze on a connected GitHub repo. Everything after that is automatic.
When it runs
Analysis is a background job, not a request you wait on. The Analyze button enqueues a job; a queue worker picks it up and reports progress as it goes, so you can navigate away and come back. It runs:
- the first time you analyze a connected repository,
- whenever you click Analyze / Regenerate again (e.g. after merging meaningful changes),
- and never as part of a chat request โ the gateway only reads analysis output at request time; it never produces it.
Note
A rapid double-click won't fan out into conflicting runs. The enqueue path cancels prior pending jobs for the same blueprint, and when a worker picks up a job it checks for a newer queued analyze of the same blueprint and skips itself if one exists.
The phases, end to end
The handler emits a fixed sequence of named phases. These are exactly the phases the dashboard renders while a run is in flight, so the diagram below is also the progress bar you watch.
flowchart TD
A["decrypt โ Verifying Access"] --> B["tree โ Scanning Repository"]
B --> C["filter โ Filtering Files"]
C --> D["fetch โ Downloading Content"]
D --> E["parse โ Analyzing Code"]
E --> F["codegraph โ Building Codegraph"]
F --> G["git โ Saving to Storage"]
G --> H["store โ Persisting Version"]
H --> I["ecosystem โ Building Ecosystem"]
I --> J["relationships โ Mapping Dependencies"]
J --> K["gateway โ Attaching Gateway"]
K --> L["policies โ Evaluating Policies"]
What each phase produces โ and where it lands
| Phase | What happens | Where the output lands |
|---|---|---|
| decrypt | Decrypts the stored GitHub connection token and re-resolves the repo's live default branch | (in-memory; nothing persisted) |
| tree | Fetches the full repository file tree from the provider | (in-memory) |
| filter | Applies smart-scan rules to pick indexable source files | (in-memory) |
| fetch | Downloads the selected file contents and config files, with per-file progress | (in-memory) |
| parse | Parses sources into chunks and generates the BlueprintOutput โ architecture, patterns, dependencies, API conventions | the blueprint content blob |
| codegraph | Builds the codegraph (symbols, calls, routes, idioms) and writes code + blueprint embeddings | Kuzu graph + Postgres snapshot; per-org vector store |
| git | Writes the generated blueprint files to git-backed storage | the org's blueprint git repo |
| store | Persists a new immutable blueprint_versions row pinned to the commit SHA | blueprint_versions (Postgres) |
| ecosystem | Upserts the repo into org_repositories with its inferred role | ecosystem tables |
| relationships | Detects and records cross-repo relationships | repo_relationships |
| gateway | Attaches the blueprint to the org's default gateway | gateway config |
| policies | Evaluates org-wide policies against the analysis results | policy results / audit |
By the end of a run you have four durable artifacts:
- a codegraph โ stored in Kuzu, projected to a Postgres snapshot for fast reads (see Codegraph โ Where it lives);
- mined idioms โ the canonical way your codebase does each thing, resolved from the graph;
- a versioned blueprint โ content-addressed and pinned to the git commit it was analyzed from (see Blueprints โ Versioning);
- embeddings โ one coarse blueprint vector plus a fine-grained vector per codegraph function, in the per-org vector store, for semantic retrieval.
Tip
The parse phase covers everything from tree-sitter parsing through
analysis, and codegraph covers the graph build and the embeddings write โ
which on large or PHP-heavy repos can run for minutes. Splitting codegraph
out from parse is deliberate: it's why the dashboard shows honest progress
instead of sitting on "Analyzing Code".
Re-analysis: what's incremental, what's rebuilt
Analysis is repeatable and idempotent โ you re-run it whenever code changes, and re-analyzing the same commit is cheap.
The codegraph and the blueprint are rebuilt from scratch each run. A re-analysis picks up new commits, so the graph and blueprint always reflect the SHA you analyzed โ never a stale working tree. Each run produces a new, immutable blueprint version; the full history is retained, and the blueprint pointer moves to the latest version.
Embeddings dedup by content_hash. Each embedded item (the blueprint
summary, each function) is keyed by a SHA-256 of its embedded text. Before
calling the embedding provider, the pipeline loads the stored hashes and keeps
only items whose hash is new or changed. Unchanged functions are skipped
entirely โ no embedding call, no write โ so re-analyzing a repo where most code
is untouched costs only a handful of round-trips.
Note
The embeddings write is opt-in and non-fatal. With no
OPENAI_API_KEY configured it no-ops, and any failure is logged but never fails
the analysis โ the codegraph, blueprint, and version still persist.
Warning
A blueprint version is a snapshot at the analyzed commit. Push new code and it's stale until the next analysis runs. Re-run Analyze after meaningful changes to keep injected context current.
How analysis output feeds the gateway
Analysis and serving are decoupled by design. The pipeline writes; the gateway reads. On each request the gateway:
- loads the attached blueprint and injects it as cached system context, so the model sees your architecture and conventions before the prompt;
- runs semantic retrieval over the per-org embeddings to surface the most relevant existing functions, so generated code matches what's already there;
- resolves graph context (callers, callees, relevant idioms) and turns policy enforcement into a graph lookup.
Because the request path only reads pre-computed artifacts, injection stays fast โ the heavy lifting already happened in the background job.
Warning
Analysis alone changes nothing about requests until the blueprint is attached to a gateway. The pipeline attaches to the org's default gateway automatically, but a blueprint with no gateway attachment is just stored intelligence.
Watching progress in the dashboard
While a run is in flight, the Repositories view streams the live phase. Each
phase shows as upcoming, active, or completed, with the item count
and elapsed time for finished phases โ so a long codegraph phase reads as
honest work rather than a hang.
If a job ever stalls, the system self-heals: a sweeper recovers jobs stuck in
the running state back to pending for retry, and a terminal sweeper marks
ancient stuck jobs as failed rather than leaving them spinning forever. In
practice you click Analyze, watch the phases tick through to Evaluating
Policies, and end with a fresh, commit-pinned blueprint attached and ready.