Theme
Prompt Caching for LLM Apps: Where It Actually Pays Off

Prompt Caching for LLM Apps: Where It Actually Pays Off

6 min read

Prompt caching is easy to explain and annoyingly easy to get wrong.

The vendor docs make the core shape explicit. Anthropic’s prompt caching caches a prompt prefix that includes tools, system, and messages in that order, with a 5 minute default TTL and an optional 1 hour TTL for slower reuse windows.1 OpenAI’s pricing page now exposes cached input as a separate price column on supported models, which is the right signal: this is a first-class cost lever, not a hidden implementation detail.2

That does not mean you should cache everything. It means you should identify the reusable prefix, freeze it with a versioned contract, and stop paying full price for the same context on every run.

The part that is actually reusable

In production, the expensive prefix usually comes from a few places:

  • system instructions that stay stable across many turns
  • tool definitions and response schemas
  • product policy or compliance text
  • shared documentation or retrieval bundles
  • few-shot examples that guide style or extraction

What does not belong in the cache is anything whose correctness depends on the current user, current permissions, current tenant, or current state of the world. If the answer changes when the user changes, the cache key needs that user boundary in it or the content should stay out of cache entirely.

That distinction matters because many teams discover caching only after they have built a giant prompt blob. The cache then hides the problem instead of fixing it. If the shared prefix keeps growing, caching is not your architecture. It is your painkiller.

The failure modes are predictable

I keep seeing four repeat offenders:

Failure modeWhat it looks like in productionWhy it hurts
Hashing the rendered prompt stringA whitespace edit or reordered section causes a missYou turn formatting noise into cache churn
Caching user-specific contextA tenant or permission change serves stale contextYou create a data exposure bug, not a latency win
Caching retrieval output without provenanceThe model sees an old doc snapshot and still sounds confidentFreshness and auditability disappear together
Treating cache hit rate as the only metricFinance loves the savings, but users do not feel the differenceYou optimize the wrong layer

The last one is common because hit rate is easy to count. It is also incomplete. A cache hit that returns the wrong scope is worse than a miss.

Version the things you control

The cache key should describe meaning, not formatting.

I want keys built from stable inputs such as:

model=gpt-5.1
policy_version=12
tool_schema_version=4
retrieval_bundle_version=2026-03-28
tenant_scope=acme-prod
locale=en-US

That shape gives you two useful properties:

  1. You can reason about invalidation without diffing the whole prompt.
  2. You can explain a miss or a stale hit in terms of a specific version change.

Anthropic’s docs also note that cache hits do not count against rate limits, which makes observability more important, not less. If you do not log hit rate by route and tenant, you can accidentally push a hot path into a slower path while the global cache number still looks good.1

Cache the compiled prefix, not the raw string

The best abstraction is usually a compiled context object:

  • normalized system instructions
  • tool schema bundle
  • shared examples
  • curated evidence or retrieval blocks
  • metadata for source versions and permission scope

This is better than caching one giant string because it keeps the boundary visible. You can still render the model-specific prompt at the edge, but the reusable artifact stays structured and inspectable.

That also makes invalidation cleaner. If the tool schema changes, you invalidate the schema component. If policy changes, you invalidate the policy component. If a doc changes, you invalidate the evidence bundle. You are no longer guessing which whitespace change matters.

Measure the right things

The metrics I care about are the ones that connect cost to actual product behavior:

MetricWhy it matters
Cache hit rate by routeTells you whether the cache matches real traffic
Input tokens savedShows direct cost reduction
p50 and p95 latency savedProves users can feel it
Miss reasonSeparates version churn from architecture churn
Error rate on hit vs missCatches correctness regressions
Cost per successful outcomeKeeps you from optimizing the wrong system

If your dashboard does not split hit rate by tenant or request type, it can lie to you. A single global number can hide the fact that your highest-value workflow barely benefits.

The rollout that does not burn you

The safe rollout is still the boring one:

  1. Log the candidate cache key and the reusable prefix.
  2. Run shadow measurements and estimate hit rate.
  3. Enable read-through caching for one narrow route.
  4. Compare latency, cost, and outcome quality against control traffic.
  5. Expand only after invalidation behaves the way you expected.

If the dry run looks weak, stop. Caching is not a substitute for prompt architecture.

When not to cache

Do not add prompt caching if:

  • the prompt changes materially on every request
  • the reusable prefix is tiny
  • your permissions model is still fluid
  • the source of truth is still moving
  • you cannot tell which prefix blocks are safe to share

That is the part people skip because caching feels like a free win. In reality, it is an architectural choice with a security boundary attached.

A useful heuristic

If you cannot freeze the reusable prefix for at least one normal reuse window, do not cache it yet.

That window is 5 minutes in Anthropic’s default cache and 1 hour if you opt into the longer TTL.1 The exact duration matters less than the operational rule: if your prefix is not stable enough to survive normal reuse, the cache will either miss constantly or go stale in ways that are hard to notice.

Prompt caching is worth doing when it removes repeated work without making the system opaque. That is why the feature exists in the first place. The hard part is still the same one every production system has: deciding what is allowed to be reused, proving it is still valid, and knowing when to throw it away.

Footnotes

  1. Anthropic prompt caching docs 2 3

  2. OpenAI pricing page