Context Engineering for Internal Assistants

One of the easiest ways to make an internal assistant look smart is to give it more context.

One of the easiest ways to make it unreliable is to do that carelessly.

OpenAI’s Agents SDK frames the modern agent stack clearly: the model can use additional context and tools, hand off to specialized agents, stream partial results, and keep a full trace of what happened. That only works if the context is assembled deliberately. OpenAI Agents SDK

The real job is not “add more data.” It is to decide which facts belong in the model’s view, in what order, and with what provenance.

The failure modes

When context is bad, the symptoms are easy to recognize:

Failure mode	What it looks like	What actually went wrong
Drowned evidence	The model misses the important fact even though it was present	The relevant chunk was buried in a long context window, a known positional bias problem. The “Found in the Middle” paper shows U-shaped attention, with the beginning and end of the prompt receiving more attention than the middle. Found in the Middle
Contaminated evidence	A doc or ticket behaves like an instruction instead of data	The retrieved content was not treated as untrusted input
Scope drift	The answer includes data from the wrong tenant or workspace	The retrieval layer did not enforce access rules before assembly
Trace drift	You can see the answer, but not the exact evidence set	The context bundle was not versioned or logged

That is why “context engineering” is a better phrase than “more retrieval.” Retrieval is only one input to the stack.

Start with the task

Different tasks want different context.

summarize the last incident
answer a policy question
draft a customer reply
explain the current deploy status

If you skip task framing, the retriever becomes a generic relevance engine pointed at everything. That usually produces noise, not confidence.

Assemble context in layers

I trust context packages more when they are built from named blocks:

task instructions
identity and scope
recent conversation state
retrieved evidence
tool-derived facts
output constraints

This is the same separation-of-concerns pattern OpenAI describes in the Assistants migration guide, where application code handles orchestration and the prompt stays focused on high-level behavior and constraints. OpenAI Assistants migration

If a block cannot justify itself, it should not be in the prompt.

Retrieval is not enough

OpenAI’s Retrieval API is semantic search over your data, and the docs note that hybrid search can balance semantic similarity with keyword overlap. That is useful for internal assistants because exact terms, IDs, and policy phrases often matter as much as semantic similarity. OpenAI Retrieval

But the retrieval result is still raw material. You still need to:

deduplicate near-identical chunks
rank evidence by task relevance
keep source metadata attached
filter by ACL before assembly
compress only after you know the evidence is the right evidence

If you summarize too early, you create a new artifact that is harder to audit than the original documents.

Keep provenance attached

Every evidence block should answer four questions:

where did it come from
who can see it
how fresh is it
why was it selected

If you cannot answer those questions, the assistant is not grounded, it is guessing with a citation-shaped memory.

That matters for debugging too. When a user says the assistant got it wrong, the useful question is not “what did the model think?” It is “what exact evidence did the model see?”

Avoid contamination

Retrieved content is not trusted just because it came from your own system.

OWASP’s current LLM risk list includes prompt injection, sensitive information disclosure, improper output handling, excessive agency, and system prompt leakage. Those are all context problems as much as model problems. OWASP LLM Top 10 2025

That is why context assembly has to distinguish between:

instruction text
evidence text
tool output
user input

Those lanes should not all be treated as equivalent prose.

Measure the bundle, not just the answer

If you want to improve context quality, instrument the bundle itself:

retrieval hit quality
token count by block
evidence age
answer success with and without the block
whether the cited source actually appeared in the final context

The answer metric alone is too late. By the time it fails, you no longer know if the fault was retrieval, ordering, compression, or the prompt.

The rule I use

An internal assistant is only as good as the context it can explain.

If the context is scoped, versioned, ordered, and reproducible, the model has a real chance to be useful. If the context is just a pile of “relevant” text, the assistant will still sound polished, but it will be polished in the wrong direction.