# वो Cache Design जो Claude Code API की लागत 90% घटा देता है

> Author: Tony Lee
> Published: 2026-02-20
> URL: https://tonylee.im/hi/blog/claude-code-cache-design-90-percent-cost-cut/
> Reading time: 5 minutes
> Language: hi
> Tags: ai, claude-code, prompt-caching, cost-optimization, anthropic

## Canonical

https://tonylee.im/hi/blog/claude-code-cache-design-90-percent-cost-cut/

## Rollout Alternates

en: https://tonylee.im/en/blog/claude-code-cache-design-90-percent-cost-cut/
ko: https://tonylee.im/ko/blog/claude-code-cache-design-90-percent-cost-cut/
ja: https://tonylee.im/ja/blog/claude-code-cache-design-90-percent-cost-cut/
zh-CN: https://tonylee.im/zh-CN/blog/claude-code-cache-design-90-percent-cost-cut/
zh-TW: https://tonylee.im/zh-TW/blog/claude-code-cache-design-90-percent-cost-cut/

## Description

Production में cache टूटने पर मेरा API बिल 10x हो गया। उसी दिन Anthropic के engineers ने बताया कि ऐसा क्यों होता है।

## Summary

वो Cache Design जो Claude Code API की लागत 90% घटा देता है is part of Tony Lee's ongoing coverage of AI agents, developer tools, startup strategy, and AI industry shifts.

## Outline

- Prefix matching में order ही सब कुछ है
- System prompt edit नहीं, messages से updates भेजें
- Tools हटाएं नहीं, छुपाएं
- Context compression में cache का एक जाल है
- Cache hit rate एक operational metric की तरह

## Content

कल मुझे production में कुछ जरूरी काम करना था। session के बीच में ही मेरा prompt cache टूट गया। उस एक घंटे का API बिल पिछले तीन दिनों के कुल खर्च से भी ज्यादा था।

वक्त की विडंबना देखिए, उसी शाम Thariq (जिन्होंने Anthropic में Claude Code बनाया) और Lance Martin ने prompt caching design पर अलग-अलग पोस्ट पब्लिश कीं। उनकी बातें पढ़कर समझ आया कि मेरा cache गलती से नहीं, बल्कि उसके design की वजह से ही कमज़ोर था।

यहाँ वही बातें हैं जो मैंने दोनों पोस्ट से निकालीं, और उस production के दर्द के चश्मे से देखीं जो मैंने अभी-अभी झेला था।

## Prefix matching में order ही सब कुछ है

Anthropic API में prompt caching request की शुरुआत से token-by-token मिलाकर काम करती है। जैसे ही एक भी character cached version से अलग होता है, उसके बाद का सब कुछ cache miss हो जाता है। कोई partial matching नहीं, कोई आगे कूदने की सुविधा नहीं।

Claude Code की team prompt ordering को infrastructure की तरह मानती है। Static system prompt पहले आता है। फिर CLAUDE.md। फिर session context। Conversation messages सबसे अंत में, क्योंकि वे हर turn पर बदलते हैं। इस ordering की वजह से महंगा और stable prefix हर request में cache होकर reuse होता रहता है।

Cached tokens की लागत regular input tokens की सिर्फ 10% होती है। यही वजह है कि cache टूटने पर बिल 10x जैसा लगता है।

मेरी गलती यह थी कि मैंने system prompt में timestamp embed कर दिया था। हर request पर नया timestamp बनता था, यानी पहले token से ही फर्क आ जाता था। नीचे कुछ भी cache नहीं हो सकता था। गलत जगह एक debug log, और मैं 100K+ tokens per request का पूरा दाम चुका रहा था।

Claude Code team ने यह भी बताया कि tool definition का non-deterministic ordering cache miss की वजह बनता है। अगर requests के बीच tools का serialize होने का क्रम बदल जाए, तो cache उसी बिंदु पर टूट जाता है, भले ही tools खुद न बदले हों।

## System prompt edit नहीं, messages से updates भेजें

जब session के बीच में context बदले (कोई file modify हो, time update हो, कोई mode बदले), तो सबसे पहला मन यही करता है कि system prompt अपडेट कर दो। यह मत करो। System prompt में कोई भी बदलाव पूरे cached prefix को invalidate कर देता है।

Claude Code इसे इस तरह handle करता है कि पहले request के बाद system prompt को हाथ नहीं लगाया जाता। बदला हुआ context अगले user message में `system-reminder` tag में wrap करके डाला जाता है। Model उसे उसी तरह पढ़ता है, लेकिन cache prefix सुरक्षित रहता है।

Plan Mode इसका अच्छा उदाहरण है। Plan Mode में switch करने का मतलब हो सकता है tool definitions बदलना, जो cache को तोड़ देगा। इसके बजाय Claude Code इसे एक tool call (`EnterPlanMode`) के रूप में implement करता है जिसे model खुद invoke कर सके। Tool set कभी नहीं बदलता। जब model को कोई मुश्किल समस्या मिलती है, वह किसी system prompt बदलाव के बिना खुद Plan Mode में enter कर सकता है।

यही तर्क model switching पर भी लागू होता है। conversation के बीच में model बदलने से cache पूरी तरह टूट जाता है। Claude Code इससे बचने के लिए अलग-अलग models को अलग contexts में subagents के रूप में चलाता है, ताकि parent conversation का cache बना रहे।

## Tools हटाएं नहीं, छुपाएं

MCP servers दर्जनों tools load कर सकते हैं। सभी को हर request में शामिल करना महंगा है। लेकिन requests के बीच tools हटाने से cache टूट जाता है क्योंकि tool definitions cached prefix का हिस्सा होती हैं।

Claude Code team का solution है `defer_loading`। पूरे tool schemas की जगह वे हल्के stubs डालते हैं जिनमें सिर्फ tool का नाम और `defer_loading: true` flag होता है। Stubs हर बार एक ही क्रम में रहते हैं, जिससे cache prefix identical रहता है। जब model को किसी tool का पूरा schema चाहिए होता है, वह `ToolSearch` tool को call करके उसे on demand load करता है।

यह pattern आज Anthropic API में उपलब्ध है। आप अपने agents में भी यही stub-and-search approach implement कर सकते हैं।

Manus के peakji ने cache hit rate को production agents के लिए सबसे निर्णायक metric बताया है। कल के बाद मैं पूरी तरह सहमत हूँ।

## Context compression में cache का एक जाल है

जब conversation context window भर जाती है, तो compress करना जरूरी हो जाता है: history को summarize करो और छोटे रूप में आगे बढ़ो। सबसे obvious तरीका यह है कि summarization prompt के साथ API call करो। लेकिन अगर वह summarization call अलग system prompt या अलग tool definitions के साथ हो, तो वह existing cache से match नहीं करेगी। नतीजा यह होगा कि 100K+ token की पूरी conversation बिना किसी cache benefit के process होगी, और यह ठीक उस वक्त जब लागत सबसे ज्यादा होती है।

Claude Code इसे parent conversation के exact system prompt और tool definitions को compression call में reuse करके solve करता है। सिर्फ आखिरी user message बदलकर compression instruction बन जाता है। Parent conversation का cached prefix तब भी match करता है, इसलिए आप सिर्फ नए message और summary output का ही पूरा दाम चुकाते हैं।

Anthropic ने तब से इस pattern को API में compaction feature के रूप में built-in कर दिया है। उन्होंने auto-caching भी release किया है, जहाँ request body में एक बार `cache_control` set करने से cache breakpoints अपने आप handle हो जाते हैं।

## Cache hit rate एक operational metric की तरह

Claude Code team cache hit rate को उसी तरह monitor करती है जैसे ops teams uptime को देखती हैं। जब यह number गिरता है, वे इसे incident की तरह treat करते हैं।

इस नजरिए ने prompt design के बारे में मेरी सोच बदल दी। हर system prompt edit, हर tool reordering, हर mid-session model switch एक संभावित incident है। सबसे सस्ता token वह है जो cache hit करे, और कल मुझे ठीक-ठीक पता चला कि दूसरा रास्ता कितना महंगा पड़ता है।

## Related URLs

- Author: https://tonylee.im/en/author/
- Publication: https://tonylee.im/en/blog/about/
- Related article: https://tonylee.im/hi/blog/eight-hooks-that-guarantee-ai-agent-reliability/
- Related article: https://tonylee.im/hi/blog/medvi-two-person-430m-ai-compressed-funnel/
- Related article: https://tonylee.im/hi/blog/claude-code-layers-over-tools-2026/

## Citation

- Author: Tony Lee
- Site: tonylee.im
- Canonical URL: https://tonylee.im/hi/blog/claude-code-cache-design-90-percent-cost-cut/

## Bot Guidance

- This file is intended for AI agents, search assistants, and text-mode retrieval.
- Prefer citing the canonical article URL instead of this text endpoint.
- Use the rollout alternates when you need the same article in another prioritized language.

---

Author: Tony Lee | Website: https://tonylee.im
For more articles, visit: https://tonylee.im/hi/blog/
This content is original and authored by Tony Lee. Please attribute when quoting or referencing.