When to Fine-Tune vs Retrieve vs Prompt
The bad decision pattern is consistent: a team has a model that is sort of right, sort of unreliable, and then reaches for the most expensive lever first.
That usually means one of three moves:
- add more prompt text
- add retrieval
- fine-tune the model
These are not interchangeable. OpenAI’s docs separate them pretty cleanly: prompt engineering is about giving the model clear instructions, goals, and context; retrieval is about semantic search over your data; and fine-tuning is about training a model to excel at a task with examples. Prompt engineering, Retrieval, Model optimization
The actual decision
I reduce the choice to one question: what is missing?
| Missing piece | The failure looks like | Best first move |
|---|---|---|
| Better instructions | The model knows the fact, but the answer is sloppy, verbose, or misformatted | Prompting |
| Better evidence | The answer depends on current, private, or tenant-scoped information | Retrieval |
| Better behavior | The model keeps making the same output or reasoning mistake across many examples | Fine-tuning |
| Better authorization | The model can see data it should not see | App-layer access control, not prompting |
If you do not diagnose the failure mode first, you will spend money on the wrong fix and then call the result “product iteration.”
Prompting is for steering, not storing
Prompting works when the model already has the capability and only needs sharper direction.
That includes:
- output shape and formatting
- tone and refusal style
- tool selection hints
- asking for clarification when inputs are incomplete
- small workflow changes that do not require durable knowledge
OpenAI’s prompt engineering guide is explicit that the prompt should provide clear goals and, where needed, extra context and examples. OpenAI Prompting
Prompting starts to fail when the prompt becomes a warehouse:
- product docs pasted inline
- policy text duplicated across routes
- long examples that mutate every deploy
- hidden behavior rules that only live in prose
At that point, you are not engineering a prompt. You are growing a fragile config blob.
Retrieval is for evidence
Retrieval is the right move when the model needs facts that live outside the base model or change faster than the model can.
OpenAI’s Retrieval API performs semantic search over your data and supports hybrid search that balances semantic and keyword matching. That is a better fit for corpora where exact terms, IDs, policy clauses, or ticket numbers matter. OpenAI Retrieval
I reach for retrieval when the task depends on:
- internal documentation
- current product or policy state
- customer-specific records
- regulated material with explicit access rules
Retrieval is also the right choice when you need to answer “where did this come from?” with something stronger than vibes. If the answer can be traced to documents, the system is easier to audit and debug.
Retrieval does not remove context design work. In fact, long-context models make the failure mode more visible. The “found in the middle” paper shows a U-shaped attention bias where tokens at the start and end of the input get more attention, regardless of relevance, and it reports RAG improvements of up to 15 percentage points after calibrating that bias. Found in the Middle
That means the job is not just “retrieve more.” It is “retrieve the right things, rank them correctly, and place them where the model is likely to use them.”
Fine-tuning is for repeated behavior
Fine-tuning pays off when the thing you are trying to fix is stable and repeated enough that examples beat instructions.
OpenAI’s fine-tuning guide is clear about the payoff: you can make a model consistently format responses, handle novel inputs in a task-specific way, use shorter prompts, and train on proprietary data without repeating it in every request. OpenAI Model optimization
Good fine-tuning candidates are usually narrow:
- classification
- extraction into a fixed schema
- domain-specific rewriting
- strict house style
What fine-tuning does not solve:
- freshness
- per-user or per-tenant permissions
- changing policy text
- knowledge that should be visible only in a scoped session
If the problem is “the model keeps being wrong because it does not see the right facts,” fine-tuning is the wrong tool. If the problem is “the model sees the facts but keeps behaving inconsistently,” fine-tuning starts to make sense.
The sequence I actually trust
The practical order is usually:
- tighten the prompt and output contract
- add retrieval if the answer depends on external evidence
- fine-tune only when the behavior is stable, measurable, and repeated
OpenAI’s Assistants migration guide makes the same separation of concerns explicit: application code handles orchestration, like history pruning, tool loops, and retries, while the prompt stays focused on high-level behavior and constraints. OpenAI Assistants migration
That split is the real reason the order matters. It keeps you from encoding infrastructure concerns into prose.
What I would not do
I would not fine-tune:
- stale facts
- tenant-scoped policy
- access control logic
- anything that changes weekly
I would not use retrieval as a substitute for clear instructions, because retrieved evidence does not tell the model what to do with it.
And I would not keep enlarging the prompt until it becomes a junk drawer. That usually means the app still does not know what it is actually responsible for.
The rule of thumb
If one good example fixes it, start with prompting.
If the answer lives in documents, records, or fresh data, use retrieval.
If the same output defect keeps showing up across many examples and routes, fine-tune.
If the issue is who can see the data, solve that in the application and retrieval layers before the model ever sees it.
That is the small set of decisions that keeps the system simpler rather than just more expensive.