Context injection
Context injection is what makes generated code look like your code. The gateway sits in front of the model and, on every request, prepends context drawn from your own repository β its architecture and conventions, plus the few existing functions most relevant to what you're asking. The model answers with that context in front of it, so suggestions follow patterns you already have instead of inventing new ones.
It runs without changing your tool or your prompt. You send a normal request; the gateway enriches it on the way upstream.
Two layers of context
Each request gets up to two complementary blocks of context. They answer different questions and are assembled differently.
| Layer | Question it answers | Scope | Freshness |
|---|---|---|---|
| Blueprint context | "What kind of repo is this?" | Whole-repo profile | Stable; cached |
| Retrieved code (RAG) | "What existing code is relevant here?" | Per-query | Re-retrieved every request |
Blueprint context is the repo's stable profile: its blueprint β role, languages, frameworks, architecture summary, and the idioms it follows. It barely changes between requests, so it's prepended once as a cached system prefix and reused across the cache window.
Retrieved code context is dynamic. For each request, the gateway runs a semantic search over your org's code embeddings to pull the handful of existing functions closest in meaning to the user's message β the concrete examples the model should match.
Note
The two layers reinforce each other: the blueprint tells the model how your repo is built, while retrieval shows it the specific code to imitate for this task.
The retrieval (RAG) read path
When retrieval is enabled, the gateway embeds the user's latest message, searches your org's vectors, selects the best matches, and prepends them as a system block.
sequenceDiagram
participant CC as Claude Code / SDK
participant GW as Unyform Gateway
participant EMB as Embeddings (OpenAI)
participant PG as pgvector (per-org)
CC->>GW: request (last user message)
GW->>EMB: embed query (text-embedding-3-small)
EMB-->>GW: 1536-dim vector
GW->>PG: cosine search, scoped to org_id
PG-->>GW: candidate matches (score-ordered)
GW->>GW: filter min_score, keep top chunks
GW->>GW: prepend [code]/[blueprint] system block
GW->>CC: (request continues upstream with context)
Step by step:
- Pick the query. The most recent user message in the request is the search query. If there's no user text, retrieval is skipped.
- Embed it. The query is embedded with OpenAI
text-embedding-3-smallβ the same model the write path used, so query and stored vectors are directly comparable (1536 dimensions). - Search, scoped to your org. A cosine-similarity search runs against
pgvector, always filtered by
org_id. There is no shared tenant; one org never sees another's vectors. - Select. Candidates below the score threshold are dropped, and the top few are kept. A slot is reserved for the blueprint summary so a flood of fine-grained code matches can't crowd out the codebase overview entirely.
- Format and prepend. Each match is tagged by kind β
[code],[blueprint], or[doc]β and prepended as asystemmessage introduced with: "The following is relevant context retrieved from this organization's codebase via semantic searchβ¦"
If nothing clears the threshold, no block is added and the request proceeds exactly as sent.
The write path: where embeddings come from
Retrieval can only find code that's been embedded. That happens during analysis, the same pipeline that builds your codegraph and blueprint.
For every function in the call graph, analysis builds a short natural-language
description β the signature (the contract), the docstring (intent), and
always the name, language, and path β then embeds it. Including the signature
keeps two same-named functions (two news, say) from collapsing into one vector.
These fine-grained per-symbol vectors are the richest signal for code retrieval.
Analysis also distills one compact, prose summary of the blueprint β role, languages, frameworks, patterns, architecture β and embeds that as a single coarse vector. It's the high-level "what is this repo" anchor that the reserved slot protects during selection.
Both write paths deduplicate by content hash: re-analyzing an unchanged repo re-embeds nothing. The store is asked which sources are unchanged first, so only new or modified functions cost an embedding call.
Tip
Function identity strips the line number, so a function that moves but doesn't change updates its existing vector instead of leaving a stale duplicate behind. Embeddings stay in sync with your code without growing unbounded.
Guarantees
Context injection is designed to help silently and never get in the way.
- Best-effort, never blocking. Every step β embedding, search, blueprint injection β is non-fatal. A missing embedding key, a slow search, or an empty result skips that layer and the request proceeds normally. Context injection cannot fail your request.
- BYOK preserved. Injection happens around your request. Your provider key is still forwarded verbatim β context is added to the payload, not to who you are to the upstream API.
- Observable. Responses carry
x-unyform-blueprintsandx-unyform-blueprint-tokensheaders showing which blueprints were applied and how much context they added β so you can confirm injection happened.
Warning
Retrieval is opt-in and requires an embeddings key to be configured for the gateway. Without it, blueprint context still applies, but per-query code retrieval is silently skipped β you'll see blueprint headers but no retrieved snippets.
Where it runs
Both layers apply across every gateway protocol β OpenAI-style chat completions,
the hosted passthrough, and the native Anthropic /v1/messages path. Blueprint
context is prepended as a cached system prefix (Anthropic
prompt caching,
cache_control: ephemeral); retrieved code is prepended as a system block.
Injection even runs on Claude Code's count_tokens preflight, so its token
estimate reflects the context the real request will carry.
Tuning facts
These defaults govern retrieval and are stated for reference:
- Minimum score is
0.3. Cosine similarity over short text embeddings is modest, so this threshold keeps clearly-relevant hits without admitting noise. - Maximum chunks is
5per request, with1slot reserved for the blueprint summary. - Embedding model is OpenAI
text-embedding-3-small(1536 dimensions) on both the read and write paths.
See the gateway for the full request path, and blueprints for how the stable profile is built.
Edit this page on GitHub