Index
5 min read

The Cache Design That Cuts Claude Code API Costs by 90%

My API costs jumped 10x when the cache broke in production. The same day, Anthropic engineers explained exactly why.

Yesterday I had urgent production work. Mid-session, my prompt cache broke. The API bill for that single hour was higher than the previous three days combined.

The timing was almost comical. That same evening, Thariq (who built Claude Code at Anthropic) and Lance Martin from Anthropic both published posts about prompt caching design. Reading their explanations, I realized my cache had been fragile by design, not by accident.

Here’s what I pulled from both posts, filtered through the production pain I’d just experienced.

Prefix matching means order is everything

Prompt caching in the Anthropic API works by matching from the start of the request, token by token. The moment a single character differs from the cached version, everything after that point becomes a cache miss. No partial matching. No skipping ahead.

The Claude Code team treats prompt ordering as infrastructure. Static system prompt goes first. Then CLAUDE.md. Then session context. Conversation messages go last, because they change on every turn. This ordering means the expensive, stable prefix gets cached and reused across every request in a session.

Cached tokens cost 10% of regular input tokens. That gap explains why a broken cache feels like a 10x price hike.

My mistake was embedding a timestamp in the system prompt. Every request generated a new timestamp, which meant the very first tokens differed every time. Nothing downstream could cache. One debug log in the wrong place, and I was paying full price on 100K+ tokens per request.

The Claude Code team also reported that non-deterministic tool definition ordering causes cache misses. If your tools serialize in a different order between requests, the cache breaks at that point even if the tools themselves haven’t changed.

Push updates through messages, not system prompt edits

When context changes mid-session (a file gets modified, the time updates, a mode switches), the instinct is to update the system prompt. Don’t. Every edit to the system prompt invalidates the entire cached prefix.

Claude Code handles this by leaving the system prompt untouched after the first request. Changed context goes into the next user message, wrapped in a system-reminder tag. The model reads it the same way, but the cache prefix stays intact.

Plan Mode is a good example. Switching to Plan Mode could mean swapping out tool definitions, which would break the cache. Instead, Claude Code implements it as a tool call (EnterPlanMode) that the model can invoke itself. The tool set never changes. When the model detects a hard problem, it can enter Plan Mode on its own without any system prompt modification.

The same logic applies to model switching. Changing models mid-conversation breaks the cache entirely. Claude Code avoids this by running different models as subagents in separate contexts, keeping the parent conversation’s cache intact.

Hide tools instead of removing them

MCP servers can load dozens of tools. Including all of them in every request is expensive. But removing tools between requests breaks the cache because tool definitions are part of the cached prefix.

The Claude Code team’s solution: defer_loading. Instead of including full tool schemas, they insert lightweight stubs containing only the tool name and a defer_loading: true flag. The stubs stay in the same order every time, keeping the cache prefix identical. When the model actually needs a tool’s full schema, it calls a ToolSearch tool to load it on demand.

This pattern is available in the Anthropic API today. You can implement the same stub-and-search approach in your own agents.

Manus’s peakji called cache hit rate the single most decisive metric for production agents. After yesterday, I agree.

Context compression has a cache trap

When a conversation fills the context window, you need to compress: summarize the history and continue in a trimmed form. The obvious approach is to call the API with a summarization prompt. But if that summarization call uses a different system prompt or different tool definitions, it won’t match the existing cache. You end up processing the entire 100K+ token conversation without any cache benefit, right at the moment when costs are highest.

Claude Code solves this by reusing the parent conversation’s exact system prompt and tool definitions for the compression call. Only the final user message changes to a compression instruction. The cached prefix from the parent conversation still matches, so you only pay full price for the new message and the summary output.

Anthropic has since built this pattern into the API as a compaction feature. They also released auto-caching, where setting cache_control once in the request body handles cache breakpoints automatically.

Cache hit rate as an operational metric

The Claude Code team monitors cache hit rate the way ops teams monitor uptime. When the number drops, they treat it as an incident.

That framing changed how I think about prompt design. Every system prompt edit, every tool reordering, every mid-session model switch is a potential incident. The cheapest token is the one that hits cache, and yesterday I learned exactly how expensive the alternative is.

Join the newsletter

Get updates on my latest projects, articles, and experiments with AI and web development.