FIELD GUIDE · LLM COST
Make LLMs cheaper in production.
Production LLM cost optimization means lowering what an application spends on large language model calls without lowering output quality. The levers I lean on are prompt and response caching, structured outputs, provider failover, token reduction, and evals to prove quality held. This is the running guide to each.
Where the money leaks.
Prompt & response caching
Most production spend is the same context sent over and over: system prompts, tool definitions, retrieved documents. Cache the stable prefix and you pay full price once, then a fraction on every reuse. The real work is deciding what is actually stable and ordering prompts so the cacheable part comes first.
Structured outputs
Free-form text you then parse is where reliability and cost both leak. Constrain the model to a schema and you drop the retries caused by malformed responses, return only the fields you need, and simplify the code downstream. Cheaper and more reliable tend to move together here.
Provider failover
One provider is one point of failure. When an API degrades or rate-limits, requests should fail over to another (Anthropic to OpenAI, or the reverse) behind a single interface. Done well, the same setup lets you route each task to the cheapest model that clears your quality bar.
Token reduction
Every token in and out is billed. Trimming dead context, compressing history, and summarizing instead of replaying lowers the cost of every single call. The trap is trimming so hard that quality slips, which is why this only works next to measurement.
Evals
You cannot cut cost safely without knowing whether quality held. Evals turn "it feels fine" into a number, so you can drop to a cheaper model, prune a prompt, or switch providers and prove the output did not get worse. They are the safety net under every other lever here.
Worked examples, with the data.
Questions I get asked.
- What is production LLM cost optimization?
- It is the practice of lowering what an application spends on large language model calls without lowering output quality. The main levers are prompt and response caching, structured outputs, provider failover, token reduction, and evaluations that confirm quality held.
- How much can prompt caching reduce LLM costs?
- It depends on how much of each request is repeated, stable context. The more your calls reuse the same system prompt, tool definitions, and retrieved documents, the more caching helps, because you pay full price for that prefix once and a fraction on every reuse. Measure your own cache-hit rate rather than trusting a headline number.
- Do structured outputs make LLM calls cheaper?
- Often yes, in two ways. Constraining the model to a schema removes retries caused by malformed responses, and it lets you return only the fields you need instead of long prose. Both cut tokens and downstream complexity.
- Why fail over between providers like Anthropic and OpenAI?
- Relying on one provider means a single outage or rate limit can take your feature down. Routing through an interface that can switch providers keeps the feature available, and as a side effect lets you send each task to the cheapest model that meets your quality bar.
- How do you cut LLM cost without hurting quality?
- By measuring quality with evals before and after each change. Evals turn quality into a number, so you can switch to a cheaper model, prune a prompt, or reduce tokens and prove the output did not regress. Without that signal, cost cuts are guesses.
Paying too much for an AI feature?
I help teams cut LLM costs without giving up quality. If that is on your plate, get in touch.