Theme
Context Engineering for Internal Assistants

Context Engineering for Internal Assistants

5 min read

One of the easiest ways to make an internal assistant look smart is to give it more context.

One of the easiest ways to make it unreliable is to do that carelessly.

OpenAI’s Agents SDK frames the modern agent stack clearly: the model can use additional context and tools, hand off to specialized agents, stream partial results, and keep a full trace of what happened. That only works if the context is assembled deliberately. OpenAI Agents SDK

The real job is not “add more data.” It is to decide which facts belong in the model’s view, in what order, and with what provenance.

The failure modes

When context is bad, the symptoms are easy to recognize:

Failure modeWhat it looks likeWhat actually went wrong
Drowned evidenceThe model misses the important fact even though it was presentThe relevant chunk was buried in a long context window, a known positional bias problem. The “Found in the Middle” paper shows U-shaped attention, with the beginning and end of the prompt receiving more attention than the middle. Found in the Middle
Contaminated evidenceA doc or ticket behaves like an instruction instead of dataThe retrieved content was not treated as untrusted input
Scope driftThe answer includes data from the wrong tenant or workspaceThe retrieval layer did not enforce access rules before assembly
Trace driftYou can see the answer, but not the exact evidence setThe context bundle was not versioned or logged

That is why “context engineering” is a better phrase than “more retrieval.” Retrieval is only one input to the stack.

Start with the task

Different tasks want different context.

  • summarize the last incident
  • answer a policy question
  • draft a customer reply
  • explain the current deploy status

If you skip task framing, the retriever becomes a generic relevance engine pointed at everything. That usually produces noise, not confidence.

Assemble context in layers

I trust context packages more when they are built from named blocks:

  1. task instructions
  2. identity and scope
  3. recent conversation state
  4. retrieved evidence
  5. tool-derived facts
  6. output constraints

This is the same separation-of-concerns pattern OpenAI describes in the Assistants migration guide, where application code handles orchestration and the prompt stays focused on high-level behavior and constraints. OpenAI Assistants migration

If a block cannot justify itself, it should not be in the prompt.

Retrieval is not enough

OpenAI’s Retrieval API is semantic search over your data, and the docs note that hybrid search can balance semantic similarity with keyword overlap. That is useful for internal assistants because exact terms, IDs, and policy phrases often matter as much as semantic similarity. OpenAI Retrieval

But the retrieval result is still raw material. You still need to:

  • deduplicate near-identical chunks
  • rank evidence by task relevance
  • keep source metadata attached
  • filter by ACL before assembly
  • compress only after you know the evidence is the right evidence

If you summarize too early, you create a new artifact that is harder to audit than the original documents.

Keep provenance attached

Every evidence block should answer four questions:

  • where did it come from
  • who can see it
  • how fresh is it
  • why was it selected

If you cannot answer those questions, the assistant is not grounded, it is guessing with a citation-shaped memory.

That matters for debugging too. When a user says the assistant got it wrong, the useful question is not “what did the model think?” It is “what exact evidence did the model see?”

Avoid contamination

Retrieved content is not trusted just because it came from your own system.

OWASP’s current LLM risk list includes prompt injection, sensitive information disclosure, improper output handling, excessive agency, and system prompt leakage. Those are all context problems as much as model problems. OWASP LLM Top 10 2025

That is why context assembly has to distinguish between:

  • instruction text
  • evidence text
  • tool output
  • user input

Those lanes should not all be treated as equivalent prose.

Measure the bundle, not just the answer

If you want to improve context quality, instrument the bundle itself:

  • retrieval hit quality
  • token count by block
  • evidence age
  • answer success with and without the block
  • whether the cited source actually appeared in the final context

The answer metric alone is too late. By the time it fails, you no longer know if the fault was retrieval, ordering, compression, or the prompt.

The rule I use

An internal assistant is only as good as the context it can explain.

If the context is scoped, versioned, ordered, and reproducible, the model has a real chance to be useful. If the context is just a pile of “relevant” text, the assistant will still sound polished, but it will be polished in the wrong direction.

Sources