Lessons from Building Claude Code: Prompt Caching Is Everything

Claude Code

Anthropic published an engineering deep-dive from the Claude Code team explaining how prompt caching underpins every architectural decision in the tool. Written by Thariq Shihipar, a technical staff member on the Claude Code team, the post reveals that cache hit rates are treated as a critical infrastructure metric β€” equivalent to service uptime β€” and that nearly every Claude Code feature was shaped by the constraint of maintaining cache integrity. The article provides concrete, transferable design patterns for anyone building agentic systems on Claude: structure content in the right order, never change the tool set mid-session, and pass stale information via messages rather than modifying the system prompt.

Featured Video

A video we selected to help illustrate this changelog


Why Prompt Caching Is the Foundation of Claude Code

Long-running agentic products like Claude Code are economically viable only because prompt caching makes them so. Without it, the input token costs of a typical Opus 4.6 coding session would be prohibitive β€” a 100-turn session with compaction cycles could run $50–100 in input tokens alone. With an 84% cache hit rate, the same session costs $10–19. Claude Code Pro at $20/month is viable precisely because of this compression.

In a post published on April 30, 2026, Thariq Shihipar β€” a technical staff member on the Claude Code team β€” shares the principles the team has internalized after designing around this constraint for over a year. The core thesis: "you fundamentally have to design agents for prompt caching first; almost every feature touches on it somehow."

How Prompt Caching Works, and Why Ordering Is Architectural

Prompt caching is prefix-matched. The API caches everything from the start of the request up to each cache_control breakpoint. Any change anywhere in the cached prefix invalidates everything after it.

Claude Code's prompt layout is organized to maximize how many sessions share the same cached prefix:

  1. Static system prompt and tools β€” shared across all users globally; never modified mid-session
  2. Project context (CLAUDE.md files, project-specific tools) β€” shared across all sessions within the same project
  3. Session-specific data β€” unique to the current session
  4. Conversation messages β€” always last, unique per session

This ordering means every Claude Code user shares the same system prompt cache. Everyone working in the same project directory shares the CLAUDE.md cache. Only the live conversation thread is unique. The ordering is not a style choice β€” it is a structural constraint that determines how much of the input cost gets compressed.

Key Design Principles for Cache-Aware Agent Development

Pass Updated Information via Messages, Not System Prompt Edits

When information in the system prompt becomes stale, the instinctive fix is to update the prompt. But that breaks the cache for the entire session and all future sessions that shared the prefix. Claude Code's solution: pass updated information as a <system-reminder> tag in the next user message instead. The cached prefix stays intact; the updated context arrives through the conversation layer without invalidating anything.

Never Change the Tool Set Mid-Conversation

Tools are part of the cached prefix. Adding a tool, removing a tool, or modifying a tool's description mid-session invalidates the cache for everything after that point. This is described as one of the most common cache-breaking mistakes in agent development.

Claude Code's Plan Mode illustrates the correct approach: rather than swapping out the tool set when entering plan mode, all tools remain present and mode-switching is implemented as a tool itself. The cached prefix survives the mode transition.

Keep the Prefix Lean with Deferred Tool Loading

Loading dozens of MCP tools at startup inflates the prefix size for every request, reducing the number of sessions that can share the same cache. Claude Code uses lightweight tool stubs with defer_loading: true β€” tools are registered with minimal descriptions until actually called. This keeps the prefix stable and compact across sessions.

Fork Cache-Aware for Compaction

When a conversation grows large enough to require summarization, naive compaction would start a fresh session with a new system prompt β€” a full cache miss that re-pays the complete input token cost. Claude Code's compaction strategy instead uses identical system prompts and tool definitions as the parent conversation, deliberately forking at a point where the prefix is already cached. The summarized session inherits the cached prefix rather than starting cold.

Treating Cache Hit Rate as Infrastructure

The post reveals that the Claude Code team runs automated alerts on prompt cache hit rates, treating degradation with the same urgency as a service outage. A cache hit rate drop triggers a SEV (severity incident) and is investigated immediately. The article frames cache hit rate not as a performance optimization but as a correctness metric β€” one that, if ignored, quietly makes the product economically unviable.

Cost Math for Agent Developers

The article provides concrete numbers that are directly applicable to any team building on Claude:

  • Cached reads cost 10% of normal input pricing ($0.50/M tokens vs. $5/M on Sonnet 4.6)
  • At a 90% cache hit rate: what would cost $100 in input tokens costs approximately $19
  • Switching models mid-session (e.g., Opus to Haiku for cost savings) can paradoxically increase total cost β€” model-specific cache prefixes mean the switch triggers a cold start, negating the per-token savings