Prompt Caching for LLM Apps: Where It Actually Pays Off

Prompt caching is easy to explain and annoyingly easy to get wrong.

The vendor docs make the core shape explicit. Anthropic’s prompt caching caches a prompt prefix that includes tools, system, and messages in that order, with a 5 minute default TTL and an optional 1 hour TTL for slower reuse windows.¹ OpenAI’s pricing page now exposes cached input as a separate price column on supported models, which is the right signal: this is a first-class cost lever, not a hidden implementation detail.²

That does not mean you should cache everything. It means you should identify the reusable prefix, freeze it with a versioned contract, and stop paying full price for the same context on every run.

The part that is actually reusable

In production, the expensive prefix usually comes from a few places:

system instructions that stay stable across many turns
tool definitions and response schemas
product policy or compliance text
shared documentation or retrieval bundles
few-shot examples that guide style or extraction

What does not belong in the cache is anything whose correctness depends on the current user, current permissions, current tenant, or current state of the world. If the answer changes when the user changes, the cache key needs that user boundary in it or the content should stay out of cache entirely.

That distinction matters because many teams discover caching only after they have built a giant prompt blob. The cache then hides the problem instead of fixing it. If the shared prefix keeps growing, caching is not your architecture. It is your painkiller.

The failure modes are predictable

I keep seeing four repeat offenders:

Failure mode	What it looks like in production	Why it hurts
Hashing the rendered prompt string	A whitespace edit or reordered section causes a miss	You turn formatting noise into cache churn
Caching user-specific context	A tenant or permission change serves stale context	You create a data exposure bug, not a latency win
Caching retrieval output without provenance	The model sees an old doc snapshot and still sounds confident	Freshness and auditability disappear together
Treating cache hit rate as the only metric	Finance loves the savings, but users do not feel the difference	You optimize the wrong layer

The last one is common because hit rate is easy to count. It is also incomplete. A cache hit that returns the wrong scope is worse than a miss.

Version the things you control

The cache key should describe meaning, not formatting.

I want keys built from stable inputs such as:

model=gpt-5.1
policy_version=12
tool_schema_version=4
retrieval_bundle_version=2026-03-28
tenant_scope=acme-prod
locale=en-US

That shape gives you two useful properties:

You can reason about invalidation without diffing the whole prompt.
You can explain a miss or a stale hit in terms of a specific version change.

Anthropic’s docs also note that cache hits do not count against rate limits, which makes observability more important, not less. If you do not log hit rate by route and tenant, you can accidentally push a hot path into a slower path while the global cache number still looks good.¹

Cache the compiled prefix, not the raw string

The best abstraction is usually a compiled context object:

normalized system instructions
tool schema bundle
shared examples
curated evidence or retrieval blocks
metadata for source versions and permission scope

This is better than caching one giant string because it keeps the boundary visible. You can still render the model-specific prompt at the edge, but the reusable artifact stays structured and inspectable.

That also makes invalidation cleaner. If the tool schema changes, you invalidate the schema component. If policy changes, you invalidate the policy component. If a doc changes, you invalidate the evidence bundle. You are no longer guessing which whitespace change matters.

Measure the right things

The metrics I care about are the ones that connect cost to actual product behavior:

Metric	Why it matters
Cache hit rate by route	Tells you whether the cache matches real traffic
Input tokens saved	Shows direct cost reduction
p50 and p95 latency saved	Proves users can feel it
Miss reason	Separates version churn from architecture churn
Error rate on hit vs miss	Catches correctness regressions
Cost per successful outcome	Keeps you from optimizing the wrong system

If your dashboard does not split hit rate by tenant or request type, it can lie to you. A single global number can hide the fact that your highest-value workflow barely benefits.

The rollout that does not burn you

The safe rollout is still the boring one:

Log the candidate cache key and the reusable prefix.
Run shadow measurements and estimate hit rate.
Enable read-through caching for one narrow route.
Compare latency, cost, and outcome quality against control traffic.
Expand only after invalidation behaves the way you expected.

If the dry run looks weak, stop. Caching is not a substitute for prompt architecture.

When not to cache

Do not add prompt caching if:

the prompt changes materially on every request
the reusable prefix is tiny
your permissions model is still fluid
the source of truth is still moving
you cannot tell which prefix blocks are safe to share

That is the part people skip because caching feels like a free win. In reality, it is an architectural choice with a security boundary attached.

A useful heuristic

If you cannot freeze the reusable prefix for at least one normal reuse window, do not cache it yet.

That window is 5 minutes in Anthropic’s default cache and 1 hour if you opt into the longer TTL.¹ The exact duration matters less than the operational rule: if your prefix is not stable enough to survive normal reuse, the cache will either miss constantly or go stale in ways that are hard to notice.

Prompt caching is worth doing when it removes repeated work without making the system opaque. That is why the feature exists in the first place. The hard part is still the same one every production system has: deciding what is allowed to be reused, proving it is still valid, and knowing when to throw it away.