Analysis pipeline

Analysis is the job that turns a repository into the intelligence Unyform governs with. It clones the repo at a commit, builds a codegraph, mines idioms, resolves a versioned blueprint, and writes embeddings for retrieval โ€” all in one pass. Nothing about your code shapes a request until it has been through this pipeline.

You start it from Dashboard โ†’ Repositories by clicking Analyze on a connected GitHub repo. Everything after that is automatic.

When it runs

Analysis is a background job, not a request you wait on. The Analyze button enqueues a job; a queue worker picks it up and reports progress as it goes, so you can navigate away and come back. It runs:

  • the first time you analyze a connected repository,
  • whenever you click Analyze / Regenerate again (e.g. after merging meaningful changes),
  • and never as part of a chat request โ€” the gateway only reads analysis output at request time; it never produces it.

Note

A rapid double-click won't fan out into conflicting runs. The enqueue path cancels prior pending jobs for the same blueprint, and when a worker picks up a job it checks for a newer queued analyze of the same blueprint and skips itself if one exists.

The phases, end to end

The handler emits a fixed sequence of named phases. These are exactly the phases the dashboard renders while a run is in flight, so the diagram below is also the progress bar you watch.

flowchart TD
    A["decrypt โ€” Verifying Access"] --> B["tree โ€” Scanning Repository"]
    B --> C["filter โ€” Filtering Files"]
    C --> D["fetch โ€” Downloading Content"]
    D --> E["parse โ€” Analyzing Code"]
    E --> F["codegraph โ€” Building Codegraph"]
    F --> G["git โ€” Saving to Storage"]
    G --> H["store โ€” Persisting Version"]
    H --> I["ecosystem โ€” Building Ecosystem"]
    I --> J["relationships โ€” Mapping Dependencies"]
    J --> K["gateway โ€” Attaching Gateway"]
    K --> L["policies โ€” Evaluating Policies"]

What each phase produces โ€” and where it lands

PhaseWhat happensWhere the output lands
decryptDecrypts the stored GitHub connection token and re-resolves the repo's live default branch(in-memory; nothing persisted)
treeFetches the full repository file tree from the provider(in-memory)
filterApplies smart-scan rules to pick indexable source files(in-memory)
fetchDownloads the selected file contents and config files, with per-file progress(in-memory)
parseParses sources into chunks and generates the BlueprintOutput โ€” architecture, patterns, dependencies, API conventionsthe blueprint content blob
codegraphBuilds the codegraph (symbols, calls, routes, idioms) and writes code + blueprint embeddingsKuzu graph + Postgres snapshot; per-org vector store
gitWrites the generated blueprint files to git-backed storagethe org's blueprint git repo
storePersists a new immutable blueprint_versions row pinned to the commit SHAblueprint_versions (Postgres)
ecosystemUpserts the repo into org_repositories with its inferred roleecosystem tables
relationshipsDetects and records cross-repo relationshipsrepo_relationships
gatewayAttaches the blueprint to the org's default gatewaygateway config
policiesEvaluates org-wide policies against the analysis resultspolicy results / audit

By the end of a run you have four durable artifacts:

  1. a codegraph โ€” stored in Kuzu, projected to a Postgres snapshot for fast reads (see Codegraph โ†’ Where it lives);
  2. mined idioms โ€” the canonical way your codebase does each thing, resolved from the graph;
  3. a versioned blueprint โ€” content-addressed and pinned to the git commit it was analyzed from (see Blueprints โ†’ Versioning);
  4. embeddings โ€” one coarse blueprint vector plus a fine-grained vector per codegraph function, in the per-org vector store, for semantic retrieval.

Tip

The parse phase covers everything from tree-sitter parsing through analysis, and codegraph covers the graph build and the embeddings write โ€” which on large or PHP-heavy repos can run for minutes. Splitting codegraph out from parse is deliberate: it's why the dashboard shows honest progress instead of sitting on "Analyzing Code".

Re-analysis: what's incremental, what's rebuilt

Analysis is repeatable and idempotent โ€” you re-run it whenever code changes, and re-analyzing the same commit is cheap.

The codegraph and the blueprint are rebuilt from scratch each run. A re-analysis picks up new commits, so the graph and blueprint always reflect the SHA you analyzed โ€” never a stale working tree. Each run produces a new, immutable blueprint version; the full history is retained, and the blueprint pointer moves to the latest version.

Embeddings dedup by content_hash. Each embedded item (the blueprint summary, each function) is keyed by a SHA-256 of its embedded text. Before calling the embedding provider, the pipeline loads the stored hashes and keeps only items whose hash is new or changed. Unchanged functions are skipped entirely โ€” no embedding call, no write โ€” so re-analyzing a repo where most code is untouched costs only a handful of round-trips.

Note

The embeddings write is opt-in and non-fatal. With no OPENAI_API_KEY configured it no-ops, and any failure is logged but never fails the analysis โ€” the codegraph, blueprint, and version still persist.

Warning

A blueprint version is a snapshot at the analyzed commit. Push new code and it's stale until the next analysis runs. Re-run Analyze after meaningful changes to keep injected context current.

How analysis output feeds the gateway

Analysis and serving are decoupled by design. The pipeline writes; the gateway reads. On each request the gateway:

  • loads the attached blueprint and injects it as cached system context, so the model sees your architecture and conventions before the prompt;
  • runs semantic retrieval over the per-org embeddings to surface the most relevant existing functions, so generated code matches what's already there;
  • resolves graph context (callers, callees, relevant idioms) and turns policy enforcement into a graph lookup.

Because the request path only reads pre-computed artifacts, injection stays fast โ€” the heavy lifting already happened in the background job.

Warning

Analysis alone changes nothing about requests until the blueprint is attached to a gateway. The pipeline attaches to the org's default gateway automatically, but a blueprint with no gateway attachment is just stored intelligence.

Watching progress in the dashboard

While a run is in flight, the Repositories view streams the live phase. Each phase shows as upcoming, active, or completed, with the item count and elapsed time for finished phases โ€” so a long codegraph phase reads as honest work rather than a hang.

If a job ever stalls, the system self-heals: a sweeper recovers jobs stuck in the running state back to pending for retry, and a terminal sweeper marks ancient stuck jobs as failed rather than leaving them spinning forever. In practice you click Analyze, watch the phases tick through to Evaluating Policies, and end with a fresh, commit-pinned blueprint attached and ready.

Edit this page on GitHub