Prompt Caching for LLM Apps: Where It Actually Pays Off
Prompt caching is easy to explain and annoyingly easy to get wrong.
The vendor docs make the core shape explicit. Anthropic’s prompt caching caches a prompt prefix that includes tools, system, and messages in that order, with a 5 minute default TTL and an optional 1 hour TTL for slower reuse windows.1 OpenAI’s pricing page now exposes cached input as a separate price column on supported models, which is the right signal: this is a first-class cost lever, not a hidden implementation detail.2
That does not mean you should cache everything. It means you should identify the reusable prefix, freeze it with a versioned contract, and stop paying full price for the same context on every run.
The part that is actually reusable
In production, the expensive prefix usually comes from a few places:
- system instructions that stay stable across many turns
- tool definitions and response schemas
- product policy or compliance text
- shared documentation or retrieval bundles
- few-shot examples that guide style or extraction
What does not belong in the cache is anything whose correctness depends on the current user, current permissions, current tenant, or current state of the world. If the answer changes when the user changes, the cache key needs that user boundary in it or the content should stay out of cache entirely.
That distinction matters because many teams discover caching only after they have built a giant prompt blob. The cache then hides the problem instead of fixing it. If the shared prefix keeps growing, caching is not your architecture. It is your painkiller.
The failure modes are predictable
I keep seeing four repeat offenders:
| Failure mode | What it looks like in production | Why it hurts |
|---|---|---|
| Hashing the rendered prompt string | A whitespace edit or reordered section causes a miss | You turn formatting noise into cache churn |
| Caching user-specific context | A tenant or permission change serves stale context | You create a data exposure bug, not a latency win |
| Caching retrieval output without provenance | The model sees an old doc snapshot and still sounds confident | Freshness and auditability disappear together |
| Treating cache hit rate as the only metric | Finance loves the savings, but users do not feel the difference | You optimize the wrong layer |
The last one is common because hit rate is easy to count. It is also incomplete. A cache hit that returns the wrong scope is worse than a miss.
Version the things you control
The cache key should describe meaning, not formatting.
I want keys built from stable inputs such as:
model=gpt-5.1
policy_version=12
tool_schema_version=4
retrieval_bundle_version=2026-03-28
tenant_scope=acme-prod
locale=en-US
That shape gives you two useful properties:
- You can reason about invalidation without diffing the whole prompt.
- You can explain a miss or a stale hit in terms of a specific version change.
Anthropic’s docs also note that cache hits do not count against rate limits, which makes observability more important, not less. If you do not log hit rate by route and tenant, you can accidentally push a hot path into a slower path while the global cache number still looks good.1
Cache the compiled prefix, not the raw string
The best abstraction is usually a compiled context object:
- normalized system instructions
- tool schema bundle
- shared examples
- curated evidence or retrieval blocks
- metadata for source versions and permission scope
This is better than caching one giant string because it keeps the boundary visible. You can still render the model-specific prompt at the edge, but the reusable artifact stays structured and inspectable.
That also makes invalidation cleaner. If the tool schema changes, you invalidate the schema component. If policy changes, you invalidate the policy component. If a doc changes, you invalidate the evidence bundle. You are no longer guessing which whitespace change matters.
Measure the right things
The metrics I care about are the ones that connect cost to actual product behavior:
| Metric | Why it matters |
|---|---|
| Cache hit rate by route | Tells you whether the cache matches real traffic |
| Input tokens saved | Shows direct cost reduction |
| p50 and p95 latency saved | Proves users can feel it |
| Miss reason | Separates version churn from architecture churn |
| Error rate on hit vs miss | Catches correctness regressions |
| Cost per successful outcome | Keeps you from optimizing the wrong system |
If your dashboard does not split hit rate by tenant or request type, it can lie to you. A single global number can hide the fact that your highest-value workflow barely benefits.
The rollout that does not burn you
The safe rollout is still the boring one:
- Log the candidate cache key and the reusable prefix.
- Run shadow measurements and estimate hit rate.
- Enable read-through caching for one narrow route.
- Compare latency, cost, and outcome quality against control traffic.
- Expand only after invalidation behaves the way you expected.
If the dry run looks weak, stop. Caching is not a substitute for prompt architecture.
When not to cache
Do not add prompt caching if:
- the prompt changes materially on every request
- the reusable prefix is tiny
- your permissions model is still fluid
- the source of truth is still moving
- you cannot tell which prefix blocks are safe to share
That is the part people skip because caching feels like a free win. In reality, it is an architectural choice with a security boundary attached.
A useful heuristic
If you cannot freeze the reusable prefix for at least one normal reuse window, do not cache it yet.
That window is 5 minutes in Anthropic’s default cache and 1 hour if you opt into the longer TTL.1 The exact duration matters less than the operational rule: if your prefix is not stable enough to survive normal reuse, the cache will either miss constantly or go stale in ways that are hard to notice.
Prompt caching is worth doing when it removes repeated work without making the system opaque. That is why the feature exists in the first place. The hard part is still the same one every production system has: deciding what is allowed to be reused, proving it is still valid, and knowing when to throw it away.