Context injection

Context injection is what makes generated code look like your code. The gateway sits in front of the model and, on every request, prepends context drawn from your own repository — its architecture and conventions, plus the few existing functions most relevant to what you're asking. The model answers with that context in front of it, so suggestions follow patterns you already have instead of inventing new ones.

It runs without changing your tool or your prompt. You send a normal request; the gateway enriches it on the way upstream.

Two layers of context

Each request gets up to two complementary blocks of context. They answer different questions and are assembled differently.

Layer	Question it answers	Scope	Freshness
Blueprint context	"What kind of repo is this?"	Whole-repo profile	Stable; cached
Retrieved code (RAG)	"What existing code is relevant here?"	Per-query	Re-retrieved every request

Blueprint context is the repo's stable profile: its blueprint — role, languages, frameworks, architecture summary, and the idioms it follows. It barely changes between requests, so it's prepended once as a cached system prefix and reused across the cache window.

Retrieved code context is dynamic. For each request, the gateway runs a semantic search over your org's code embeddings to pull the handful of existing functions closest in meaning to the user's message — the concrete examples the model should match.

Note

The two layers reinforce each other: the blueprint tells the model how your repo is built, while retrieval shows it the specific code to imitate for this task.

The retrieval (RAG) read path

When retrieval is enabled, the gateway embeds the user's latest message, searches your org's vectors, selects the best matches, and prepends them as a system block.

sequenceDiagram
    participant CC as Claude Code / SDK
    participant GW as Unyform Gateway
    participant EMB as Embeddings (OpenAI)
    participant PG as pgvector (per-org)
    CC->>GW: request (last user message)
    GW->>EMB: embed query (text-embedding-3-small)
    EMB-->>GW: 1536-dim vector
    GW->>PG: cosine search, scoped to org_id
    PG-->>GW: candidate matches (score-ordered)
    GW->>GW: filter min_score, keep top chunks
    GW->>GW: prepend [code]/[blueprint] system block
    GW->>CC: (request continues upstream with context)

Step by step:

Pick the query. The most recent user message in the request is the search query. If there's no user text, retrieval is skipped.
Embed it. The query is embedded with OpenAI text-embedding-3-small — the same model the write path used, so query and stored vectors are directly comparable (1536 dimensions).
Search, scoped to your org. A cosine-similarity search runs against pgvector, always filtered by org_id. There is no shared tenant; one org never sees another's vectors.
Select. Candidates below the score threshold are dropped, and the top few are kept. A slot is reserved for the blueprint summary so a flood of fine-grained code matches can't crowd out the codebase overview entirely.
Format and prepend. Each match is tagged by kind — [code], [blueprint], or [doc] — and prepended as a system message introduced with: "The following is relevant context retrieved from this organization's codebase via semantic search…"

If nothing clears the threshold, no block is added and the request proceeds exactly as sent.

The write path: where embeddings come from

Retrieval can only find code that's been embedded. That happens during analysis, the same pipeline that builds your codegraph and blueprint.

For every function in the call graph, analysis builds a short natural-language description — the signature (the contract), the docstring (intent), and always the name, language, and path — then embeds it. Including the signature keeps two same-named functions (two news, say) from collapsing into one vector. These fine-grained per-symbol vectors are the richest signal for code retrieval.

Analysis also distills one compact, prose summary of the blueprint — role, languages, frameworks, patterns, architecture — and embeds that as a single coarse vector. It's the high-level "what is this repo" anchor that the reserved slot protects during selection.

Both write paths deduplicate by content hash: re-analyzing an unchanged repo re-embeds nothing. The store is asked which sources are unchanged first, so only new or modified functions cost an embedding call.

Tip

Function identity strips the line number, so a function that moves but doesn't change updates its existing vector instead of leaving a stale duplicate behind. Embeddings stay in sync with your code without growing unbounded.

Guarantees

Context injection is designed to help silently and never get in the way.

Best-effort, never blocking. Every step — embedding, search, blueprint injection — is non-fatal. A missing embedding key, a slow search, or an empty result skips that layer and the request proceeds normally. Context injection cannot fail your request.
BYOK preserved. Injection happens around your request. Your provider key is still forwarded verbatim — context is added to the payload, not to who you are to the upstream API.
Observable. Responses carry x-unyform-blueprints and x-unyform-blueprint-tokens headers showing which blueprints were applied and how much context they added — so you can confirm injection happened.

Warning

Retrieval is opt-in and requires an embeddings key to be configured for the gateway. Without it, blueprint context still applies, but per-query code retrieval is silently skipped — you'll see blueprint headers but no retrieved snippets.

Where it runs

Both layers apply across every gateway protocol — OpenAI-style chat completions, the hosted passthrough, and the native Anthropic /v1/messages path. Blueprint context is prepended as a cached system prefix (Anthropic prompt caching, cache_control: ephemeral); retrieved code is prepended as a system block.

Injection even runs on Claude Code's count_tokens preflight, so its token estimate reflects the context the real request will carry.

Tuning facts

These defaults govern retrieval and are stated for reference:

Minimum score is 0.3. Cosine similarity over short text embeddings is modest, so this threshold keeps clearly-relevant hits without admitting noise.
Maximum chunks is 5 per request, with 1 slot reserved for the blueprint summary.
Embedding model is OpenAI text-embedding-3-small (1536 dimensions) on both the read and write paths.

See the gateway for the full request path, and blueprints for how the stable profile is built.

Edit this page on GitHub

← Policies Analysis pipeline →