5 मार्च 2026 9 मिनट पढ़ने में अपडेट किया गया 12 मार्च 2026

Codex Compaction को Differently Solve कैसे करता है

मैंने reverse-engineer किया कि Codex, context overflow को Claude Code से अलग कैसे handle करता है। जवाब में है AES encryption, session handover patterns, और KV cache tricks।

अगर आपने Claude Code को कभी serious coding session के लिए use किया है, तो terminal में “Compacting conversation…” जरूर देखा होगा। उसके बाद कुछ अजीब-सा लगने लगता है। दस मिनट पहले जो बात हुई थी, model वो भूलने लगता है। Response latency बढ़ जाती है। जो function आप दोनों ने मिलकर refactor किया था, उसके बारे में पूछो तो जवाब ऐसा आता है जैसे पहली बार सुन रहा हो।

यह इसलिए होता है क्योंकि Claude Code का 200K token context window उतनी जल्दी भर जाता है जितना कोई सोचता नहीं। एक बड़ा refactoring session, कुछ file reads, verbose output वाले tool calls, और आप capacity पर आ जाते हैं। जब यह threshold hit होती है (roughly 75-92% window, हालांकि मैंने 65% पर भी trigger होते देखा है), Claude Code conversation को summarize करता है, original messages drop करता है, और सिर्फ summary के साथ continue करता है। जो information summary में नहीं आई, वो चली जाती है।

काफी समय से यह चर्चा थी कि OpenAI का Codex इसे अलग तरह से handle करता है। मैंने जितने भी public analyses मिले, उन सबको खंगाल डाला। सबसे दिलचस्प काम Kangwook Lee का था, जो Krafton के CAIO हैं। उन्होंने prompt injection का use करके actual pipeline को reverse-engineer किया।

Compaction में क्या खो जाता है और क्यों मायने रखता है

Core problem सीधी है। Summarization एक lossy compression है। जब Claude Code compact करता है, तो full conversation का background summarization run होता है, एक compaction block बनता है, और उसके पहले की सब चीज़ें discard हो जाती हैं। CLAUDE.md files बचती हैं क्योंकि वो disk से re-read होती हैं, लेकिन जो बात सिर्फ conversation में कही थी वो disappear हो जाती है।

Tool call results में यह सबसे ज्यादा नुकसान करता है। जब आप Claude Code से कोई file read करवाते हैं, तो पूरा file content context में आता है। Command run करवाते हैं, पूरा output context में। यही tool results conversation का सबसे information-dense हिस्सा होते हैं, और summarization में यही सबसे ज्यादा flatten होते हैं। 500-line file read होकर एक sentence बन जाता है: “configuration file read की और database settings note किए।” Specific values, edge cases, line numbers, सब गायब।

मैंने यह दर्जनों बार होते देखा है। Compaction के बाद पूछो “उस helper function का return type क्या था जो हमने देखा था?” और confident लेकिन गलत जवाब मिलता है। Model hallucinate नहीं कर रहा usual sense में। वो एक ऐसे summary से काम कर रहा है जिसमें वो information है ही नहीं।

9 या उससे ज्यादा compactions के बाद problem compound होती जाती है। हर summary पिछली summary को और compress करती है। Session के शुरुआत में जो decision rationale था, वो completely erode हो जाता है। 10 घंटे के session के बाद model को याद नहीं रहता कि approach A को B के ऊपर क्यों चुना था, चाहे उस पर बीस मिनट की trade-off discussion क्यों न हुई हो।

Codex का Encrypted Compaction Pipeline

Kangwook Lee का analysis काफी clever था। उन्होंने दो chained prompt injections use करके Codex के compaction system का internal behavior extract किया।

पहला injection compactor LLM को target करता था। जब Codex compaction trigger करता है, तो locally summarize नहीं करता। Conversation को OpenAI के servers पर एक separate LLM के पास भेजता है जो summary produce करता है। Lee के injection ने इस compactor को trick किया कि वो अपना system prompt भी summary output में शामिल कर ले। फिर server ने इस summary को (जिसमें अब leaked prompt था) AES-encrypt करके एक opaque blob की तरह return किया।

दूसरा injection decryption step को exploit करता था। Encrypted blob और एक crafted user message को Responses API में pass करके server ने blob decrypt किया और model का context assemble किया। चूंकि पहले injection ने compactor का system prompt summary के अंदर embed कर दिया था, decrypted context ने पूरे pipeline का तरीका reveal किया।

जो मिला वो यह था: जब आप Codex की compact() API call करते हैं, तो एक separate LLM conversation summarize करता है और result AES-encrypted होकर वापस आता है। अगली turn पर server यह blob decrypt करता है, एक handoff prompt prepend करता है (“यहाँ पिछली conversation का summary है”), और पूरी चीज़ model को feed करता है। Encryption key OpenAI के servers पर रहती है। Client को plaintext summary कभी नहीं दिखता।

Compaction prompt itself open-source Codex CLI के non-Codex models के compaction template से लगभग identical निकला। Prompt engineering में कोई secret sauce नहीं। Interesting हिस्सा architecture है: summaries का server-side encryption, server-side decryption और injection, और एक opaque blob जिसे client pass तो करता है लेकिन inspect या modify नहीं कर सकता।

Encrypt क्यों करते हैं? Lee के analysis ने इसका definitive जवाब नहीं दिया। एक theory है कि encrypted blob में सिर्फ text summary से ज्यादा कुछ है: tool call restoration data, internal state markers, या structured metadata जो OpenAI expose नहीं करना चाहता। दूसरी possibility यह है कि encrypted blobs users को summary के साथ tamper करने से रोकते हैं। मुझे दूसरी explanation ज्यादा likely लगती है, लेकिन कोई confirm नहीं।

OpenAI Responses API के through server-side compaction support भी करता है। compact_threshold value set करो, और जब token count उसे cross करे, server inline compaction run करता है। Compaction item response के अंदर stream होता है, और आप इसे subsequent requests में append करते हैं।

Claude Code का approach इससे contrast करे: compaction block human-readable है। आप उसे inspect कर सकते हैं, और instructions parameter या CLAUDE.md में custom compaction directives add करके behavior customize कर सकते हैं। ज्यादा transparent है, लेकिन fundamental information loss दोनों में same है।

Session Handover Pattern

Compaction mechanics से भी interesting problem यह है कि नया session start करते वक्त context कैसे बचाएं। यहाँ एक developer की automation देखी जिसने मेरी सोच बदल दी।

Pattern इस तरह काम करता है। Compaction trigger होने से ठीक पहले, एक pre-compact hook सभी write tools block करता है। यह model को partially-aware state में code changes करने से रोकता है, जो एक failure mode है जो मुझे कई बार hit हुई है: compaction mid-refactor fire होती है, model track खो देता है कि कौन-सी files पहले ही change की जा चुकी हैं, और conflicting edits लिख देता है।

Writes blocked होने के बाद, system JSONL session log से सिर्फ user messages और thinking blocks extract करता है। बाकी सब, tool calls, file contents, assistant responses, drop हो जाता है। इससे log का size original का roughly 2% रह जाता है।

फिर तीन sub-agents parallel में run करते हैं, हर एक original uncompressed JSONL logs में वो information ढूंढता है जो extraction में miss हो गई। ये gaps ढूंढते हैं: architectural decisions जो discuss तो हुई लेकिन user messages में नहीं आईं, error patterns जो सिर्फ tool output में दिखे, rejected approaches के rationale। ये agents अपने findings एक resume-prompt.md file में compile करते हैं जिसमें session summary, gap analysis results, और modified files की list होती है।

VS Code का file watcher नई resume-prompt.md detect करता है और एक fresh session खोलता है जो इसे initial context की तरह load करता है। New session शुरू होती है एक clear, complete picture के साथ कि पिछली session कहाँ छोड़ी थी।

Reported improvement था build efficiency में 10x। यह number independently verify करना मुश्किल है, लेकिन architecture sense बनाता है। एक increasingly degraded summary की जगह, आपको एक fresh context window मिलती है जिसमें curated, gap-checked handover document है।

मैंने खुद इसका एक simpler version implement करने की कोशिश की। Gap analysis step में ही value concentrate होती है। उसके बिना, आप वही कर रहे हैं जो compaction पहले से करती है, बस अलग format में। उसके साथ, आप actively वो information recover कर रहे हैं जो summarization ने खोई थी। मेरे version में तीन की जगह एक sub-agent है, और results raw compaction से noticeably बेहतर हैं लेकिन शायद full three-agent approach जितने thorough नहीं।

KV Cache: Hidden Cost Lever

इसमें एक performance dimension है जो ज्यादातर discussions में completely miss हो जाती है। KV cache (attention के दौरान compute होने वाले key-value pairs) को requests के बीच reuse किया जा सकता है जब prompt prefix identical हो। Same opening tokens share करने वाले दो requests उन tokens का recomputation skip करते हैं।

Numbers significant हैं। Stable vs. perturbed system prompts की controlled test में, stable prefixes ने 85% cache hit rate achieve किया, median time-to-first-token 953ms था। Perturbed prefixes: 0% cache hits, 2,727ms TTFT। Cost per request $0.033 से $0.009 पर आ गई। यह है 65% latency reduction और 71% cost reduction, सिर्फ prompt prefix consistent रखने से।

इसका session handover pattern पर direct implication है। अगर आपका resume-prompt.md हमेशा same structural prefix से start होता है (system prompt, handoff instructions, फिर variable content), तो fixed portion cache हो जाता है। New session की हर subsequent request उस cache से benefit लेती है। अगर आप prefix structure randomize करते हैं या शुरुआत में ही variable content inject करते हैं, तो हर request scratch से recompute होती है।

मैंने अपना session folder structure इसी insight के around design किया। Session-id-based archiving handover documents organized रखता है, और resume prompts के लिए fixed-prefix convention का मतलब है कि हर new session के पहले 40-50K tokens KV cache hit करते हैं। Session archives को QMD से pre-index करना (जिसे मैंने अलग से cover किया है) retrieval step को faster बनाता है जब sub-agents को historical sessions search करने होते हैं।

जो Actually मायने रखता है

असली takeaway यह नहीं है कि Codex का approach Claude Code से better है या worse। दोनों compaction में information lose करते हैं। दोनों long sessions में struggle करते हैं। Architectural difference (encrypted opaque blob vs. human-readable compaction block) अलग-अलग design philosophies को reflect करती है, लेकिन fundamental limitation same है: context windows finite हैं, और summarization lossy है।

जो मायने रखता है वह है कि आप उस limitation के आसपास क्या build करते हैं। Session handover pattern, gap analysis, JSONL-based retrieval, KV cache optimization: ये engineering solutions हैं उस problem के लिए जिसे model improvement पूरी तरह solve नहीं करेगा। 500K या 1M token context window problem को delay करती है, eliminate नहीं।

AI coding tools में असली bottleneck model intelligence नहीं है। Context management है। यह मैंने खुद देखा है: mediocre summary के साथ good retrieval, excellent summary के साथ no retrieval को हमेशा outperform करती है। ऐसे systems build करना जो भूली हुई information reliably retrieve करें, ऐसे systems से ज्यादा मायने रखता है जो ज्यादा accurately summarize करें।

Technical details sourced from Kangwook Lee’s analysis और public API documentation from OpenAI और Anthropic।

न्यूज़लेटर से जुड़ें

नवीनतम AI पर इनसाइट्स पाएँ।