Theme
The LLM Cost and Scaling Playbook: Cut Your Bill Without Killing Quality

The LLM Cost and Scaling Playbook: Cut Your Bill Without Killing Quality

11 min read

If your LLM app is working, congrats. Your next problem is that it is about to get expensive.

The frustrating part is that most teams try to solve cost by arguing about models. That helps, but it is rarely the biggest lever. The biggest levers are usually boring: how many tokens you ship, how often you repeat them, and how many times you retry.

This is my playbook for scaling LLM usage without watching your margin evaporate. It is written for product-minded engineers who want to ship, measure, and iterate.

TL;DR (save this)

  • Track cost per successful outcome, not cost per request.
  • Route: use a cheap model by default, escalate only when needed.
  • Shrink: reduce prompt size with tight instructions, structured outputs, and context budgeting.
  • Reuse: apply prompt/context caching and application-level caching aggressively.
  • Shift: push non-urgent work into batch / async where your provider discounts it.
  • Stabilize: make retries rare with guardrails, timeouts, and rate-limit aware queues.

Step 0: Know what you are paying for

Most LLM bills are just a variant of:

cost = (input_tokens * input_price) + (output_tokens * output_price) + tool_fees + retries

You can debate input vs output pricing forever, but your fastest win usually comes from reducing the “retries” term and then shrinking input tokens.

Also: tool use is never free. If your agent does web search, file search, code execution, or uses a hosted retrieval tool, read the provider docs for how those are billed.

Step 1: Instrument the four numbers that matter

Do this before you touch prompts. You cannot optimize what you cannot see.

Track these per endpoint, per tenant, and per model:

  1. Input tokens (p50, p90, max)
  2. Output tokens (p50, p90, max)
  3. Retries per successful response (and why)
  4. Wall time (including tool calls)

Then define a single metric you can rally around:

Cost per successful outcome

Examples:

  • “Cost per ticket resolved”
  • “Cost per report generated”
  • “Cost per meeting summary delivered”
  • “Cost per onboarding completed”

This prevents a classic failure mode: you cut token usage by 40%, but success rate drops and humans redo the work. Your cost per outcome goes up, not down.

Step 2: Start with model routing, not prompt micro-optimizations

If you are sending every request to your biggest model, you are donating money to the internet.

A simple router can cut spend fast:

  • Use a smaller model for: classification, extraction, formatting, short answers, search query generation, “is this safe,” “is this in scope,” and first-pass drafts.
  • Escalate only if: confidence is low, the user is paying, or the action is high stakes.

Routing options that work in practice:

  • Rules first, then model. Example: “If the user asks for code, route to coding model.” This is cheap and predictable.
  • Two-pass: small model decides whether escalation is needed, then routes.
  • Budget-based: cap per request, degrade gracefully when budget is exceeded.

This is also where you control quality perception: make the fast path feel instant, and reserve the slow, expensive path for the moments that matter.

Step 3: Shrink tokens with context budgets (the easiest money you will ever save)

Most production prompts are bloated for three reasons:

  1. You keep appending conversation history forever.
  2. You pack in too many examples “just in case.”
  3. You send large documents repeatedly because retrieval is hard.

Fix it with a hard budget.

3.1 Set a context budget per request

Pick a number (for example 4k, 8k, 16k input tokens) and enforce it.

Use a simple prioritization order:

  1. System instructions and tool schema (keep stable)
  2. Last user message (always include)
  3. Critical memory (short, curated)
  4. Retrieval results (top-k, trimmed)
  5. Short rolling summary of older chat
  6. Raw chat history (last N turns only)

When you hit the budget, cut from the bottom. Do not negotiate with the token counter.

3.2 Prefer structured outputs to reduce retries

Retries are stealthy cost multipliers.

If you need JSON, enforce JSON. If you need a schema, use one. The biggest cost win is fewer failed parses and fewer “try again” loops.

3.3 Stop sending full documents when a paragraph will do

If your retrieval layer is returning 12 chunks, you are paying to confuse the model.

Better patterns:

  • Retrieve fewer chunks, but make them higher quality.
  • Compress chunks (extract the relevant sentences) before the main call.
  • Use a small model to do “evidence selection” cheaply.

Step 4: Caching is your superpower (and it is not just HTTP caching)

Caching is the closest thing you get to a cheat code in LLM economics. It can cut cost, cut latency, and make your product feel more stable.

4.1 Application-level caching (you control it)

Cache anything that is deterministic enough:

  • Moderation results
  • Classification labels
  • Canonical tool results (for example, a customer record lookup)
  • Summaries of a document at a given revision
  • Embeddings for the same content hash

Key it on inputs that actually change: user id, org id, doc hash, tool version, prompt version.

This is the unglamorous way teams get a 30% bill reduction in a week.

4.2 Provider prompt/context caching (you get discounts)

Several providers now discount repeated prefixes, but the rules differ.

ProviderHow you enable itWhat gets cachedRetention / TTL (per docs)Pricing notes
OpenAIAutomatic for eligible prompts; optional prompt_cache_keyPrompt prefix caching (see docs)Varies by implementation and configurationCached input tokens are discounted; see pricing
AnthropicExplicit cache_control breakpointsPrompt prefix up to your breakpoints5-minute and 1-hour cache durationsDiscounted token pricing for cache reads and writes
GeminiExplicit cached content objectsCached content you create (plus implicit caching for some requests)TTL you set (docs note 1-hour default)Separate cached-content storage billing plus token pricing

If you are using OpenAI, read their Prompt Caching guide. OpenAI documents prompt caching and discounted cached input tokens, and you can provide a prompt_cache_key for routing and monitoring. See OpenAI pricing for cached-token and batch discounts.

If you are using Anthropic, prompt caching is explicit via cache_control breakpoints and has specific cache lifetimes. See Anthropic prompt caching and Anthropic pricing.

If you are using Google Gemini, context caching supports explicit caches with TTL and separate storage pricing. See Gemini context caching and Gemini pricing.

What to do in real life:

  • Put stable parts first: system instructions, tool definitions, schemas, examples.
  • Put volatile parts last: user input, retrieved snippets, per-user details.
  • Track cache hit rates and cached tokens in your logs.

4.3 Which caching strategy is best? A practical comparison

There is no single “best caching.” There is a best order of operations.

Here are the caching strategies that actually move your bill, what they are good for, and where teams get burned.

StrategyBest forBiggest winBiggest risk
Application-level cache (your DB/Redis/KV)Deterministic steps and repeated tool callsHuge cost win and predictableStale data if you do not version keys
Provider prompt caching (prefix caching)Long, stable prefixes: system prompt, schemas, tool defsDiscounts on repeated input tokensLow hit rate if your prefix changes even slightly
Output cache (response cache)Idempotent “generate X for Y” endpointsInstant repeats, great UXServing the wrong answer across users/tenants
Retrieval cacheSame queries repeated frequentlyLess embedding/search overheadCache pollution if you do not scope by user/org
Embedding cacheRe-embedding the same contentRemoves a silent recurring costMissed invalidation when content changes

Strategy 1: Application-level caching (usually the best first move)

This is the best default because it is provider-agnostic and you can reason about it.

Cache things like:

  • Policy decisions: “Can this user do this?”
  • Moderation results
  • Routing decisions: “Does this need the expensive model?”
  • Tool calls: “Fetch account details”
  • Summaries keyed by document revision

Make the cache key include the things that make an output safe to reuse:

  • Tenant or org id
  • User id when needed
  • Prompt or tool version (so a deploy can invalidate old values)
  • Input hash (and doc hash for document-derived outputs)

If you do only one thing, do this.

Strategy 2: Provider prompt caching (high leverage when you have long prompts)

Provider caching is great when you have a large stable prefix and you call it a lot.

In practice, that means:

  • Your system prompt is long (policies, style, output schema).
  • Your tool schemas are non-trivial.
  • You have “few-shot” examples.

How to make it work:

  • Keep your prefix stable. One extra space or a different tool ordering can drop your hit rate.
  • Put user-specific and retrieval content at the end.
  • Log cached-token counts and cache hit indicators when the provider exposes them.

Provider differences to know:

  • OpenAI: caching is automatic when eligible, and the docs describe prompt_cache_key. If your traffic is bursty, prompt caching can reduce cost spikes because repeats become cheaper. See OpenAI’s Prompt Caching guide and pricing.
  • Anthropic: caching is explicit via cache_control breakpoints with documented cache lifetimes. See Anthropic prompt caching.
  • Gemini: caching supports explicit caches with TTL and separate storage pricing, and Gemini also documents implicit caching for some requests. See Gemini context caching and pricing.

Strategy 3: Output caching (powerful, but do not overuse it)

Output caching is only safe when the answer is not personalized and not time-sensitive, or when you scope it tightly.

Good candidates:

  • “Summarize this document version”
  • “Extract entities from this text”
  • “Generate a title for this article draft”

Bad candidates:

  • Anything that depends on user identity or private org context (unless your cache key includes those).
  • Anything that depends on current data (unless you set a short TTL and accept staleness).

When output caching is safe, it feels magical. When it is not safe, it is a security incident.

4.4 The simple rule for choosing caching

Use this order:

  1. Application-level caching for deterministic work and tool calls.
  2. Provider prompt caching when you have long, stable prefixes.
  3. Output caching only for tightly scoped, low-risk endpoints.

Do not chase provider caching until you know your prompts are stable and you have instrumentation for hit rate.

Step 5: Batch the work you do not need right now

If you can wait minutes or hours, stop paying real-time prices.

Both OpenAI and Anthropic document batch processing that discounts input and output tokens, and Google Gemini pricing includes batch tiers for several models. Read your provider’s batch docs carefully, including queue limits and completion windows: OpenAI Batch API, Anthropic pricing, and Gemini pricing.

Common batch workloads:

  • Backfilling embeddings
  • Rewriting old content
  • Large-scale classification
  • Offline evaluations
  • Summarizing long archives

The win is not only cost. You also avoid hammering synchronous rate limits.

Step 6: Design for rate limits and throughput from day one

Rate limits are not a “later” problem. Rate limits are why your p90 is fine and your p99 is on fire.

For OpenAI, the rate limits guide is worth reading end to end, including the headers that tell you what is left and when it resets.

Scaling patterns that work:

  • Put LLM calls behind a queue with per-model concurrency.
  • Use jittered exponential backoff for retryable failures.
  • Separate interactive traffic from bulk jobs (and use batch where possible).
  • Add per-tenant budgets and circuit breakers.

If your traffic ramps hard (launch day, social spike), you want the system to degrade gracefully instead of melting.

Step 7: A realistic “cut the bill in half” checklist

This is the sequence that tends to work:

  1. Kill retries: fix JSON parsing, add guardrails, reduce tool flakiness.
  2. Route: small model first, escalate on demand.
  3. Budget: enforce a context limit and trim retrieval.
  4. Cache: app-level caching, then provider caching tuning.
  5. Batch: move offline workloads off the hot path.
  6. Audit: find tenants or endpoints with runaway tokens.

If you do this in order, you usually get meaningful savings without making the product feel worse.

A quick, honest note on “cheaper models”

Cheaper models are great when the task is bounded and you can validate outputs. They are risky when the task is open-ended and the user notices failure.

The best cost optimization is not “use a worse model.” It is “use the right model at the right time, and stop wasting tokens.”

Sources (first-party docs)