26 फ़रवरी 2026 5 मिनट पढ़ने में

मैंने 300 Agent Failure Logs खंगाले। समस्या कभी Prompt नहीं थी।

एक open-source context engineering skillset ने 10k GitHub stars पार किए। इसे अपने agent stack पर apply करने के बाद, मुझे आखिरकार समझ आया कि agents क्यों fail होते हैं।

तीन सौ agent failure logs। दो हफ्तों में मैंने इन सबको खंगाला, हर एक को root cause के हिसाब से tag किया। जो breakdown निकली, वो मेरे लिए चौंकाने वाली थी: prompt की समस्या शायद 12% मामलों में थी। बाकी सब में? Context या तो contaminated था, overflow हो रहा था, या बिल्कुल गायब था। Model बदलने से कोई फर्क नहीं पड़ा। Tools बदलने से भी नहीं। यही pattern हर बार दिखा।

मैं काफी समय से context engineering में गहरा उतरा हुआ हूं, इसलिए जब Agent Skills for Context Engineering नाम का एक open-source project सामने आया और तेजी से 10,000 GitHub stars पार कर गया, तो मैंने ध्यान दिया। यह MIT-licensed है, Muratcan Koylan नाम के एक context engineer ने बनाया है, और Peking University की AI lab के एक paper में इसे cite किया गया है। यही आखिरी बात थी जिसने मुझे इसे clone करने पर मजबूर किया।

छोटे context windows ज्यादा accurate होते हैं

मैं मान लेता था कि context में जितने ज्यादा tokens ठूंसो, उतना बेहतर। मैं गलत था। यह skillset जो पहला principle सिखाती है वो है “information density, not information volume।”

जैसे-जैसे context लंबा होता है, model बीच वाली चीजें भूलने लगता है। यही U-curve effect है: model शुरुआत और अंत तो अच्छे से पढ़ता है, लेकिन बीच की सब चीजें उड़ा देता है। मैंने खुद इसे test किया। 128K tokens तक context भरा, फिर वही information 32K में compress की। Compressed version ने accuracy में ज्यादा score किया।

Processing cost, token count के साथ linear नहीं बल्कि exponential रूप से बढ़ती है। Context आधा करने से response latency 40 से 60 प्रतिशत कम हो गई। Prefix caching के बावजूद, लंबे inputs महंगे रहते हैं। एक लाइन में बात करें तो: मायने यह रखता है कि दिए गए token budget में आप कितनी useful information pack करते हैं।

Tool descriptions, agent performance का 80% तय करते हैं

आप perfect prompt लिख सकते हैं, लेकिन अगर आपके tool descriptions ढीले हैं, तो agent गलत tool चुन लेता है। यह skillset इसे बखूबी frame करती है: “Tools are contracts read by LLMs, not humans।” जब मेरी team ने एक MCP server बनाया, तो हमने इस guide को follow करते हुए अपने tool descriptions फिर से लिखे। Tool selection failures में साफ कमी आई।

हर tool description में यह specify होना चाहिए कि इसे कब use करें और यह क्या return करता है। जब दो tools का function overlap करता है, तो इंसान भी confuse होते हैं और agents तो उससे भी ज्यादा। एक comprehensive tool आमतौर पर कई narrow tools से बेहतर होता है। और error messages को agent को यह बताना चाहिए कि आगे क्या करना है, न कि सिर्फ यह कि क्या गलत हुआ।

Multi-agent systems को agents से पहले architecture चाहिए

कई agents spin up करके यह उम्मीद रखना कि वो अपने आप collaborate करेंगे, wishful thinking है। Repo तीन patterns को साफ तौर पर define करता है: एक orchestrator जो subordinate agents को direct करे, peer-to-peer model जहां agents बराबरी से communicate करें, और hierarchical delegation chain।

तीनों को production में try करने के बाद, orchestrator pattern सबसे predictable और debug करने में सबसे आसान रहा। Subordinate agents ने results file system के through pass किए। Peer-to-peer model creative tasks के लिए बेहतर था, लेकिन infinite loops का खतरा था। Structured queries के लिए, shared files ने vector search को पीछे छोड़ा। व्यवहार में, मुझे तीन agents stability की upper limit लगी।

Memory को Vector search अकेले handle नहीं कर सकता

Vector search आसानी से “Customer X ने Date Z को Product Y खरीदा” ढूंढ सकता है। लेकिन “जिन customers ने Product Y खरीदा, उन्होंने और क्या खरीदा?” इसका जवाब वो नहीं दे सकता। Relational information embeddings में खो जाती है।

यह skillset एक four-tier memory architecture propose करती है: context window के अंदर working memory, session के भीतर short-term memory, sessions के पार long-term memory, और archives के रूप में permanent memory। File-system-as-memory pattern सबसे practical था जो मैंने test किया। आप ls और grep से context navigate करते हैं, embedding queries की जगह। Tool results को एक scratchpad file में dump करने से context window space काफी बचा।

Evaluation सबसे underrated agent skill है

यह वो section था जिसे मैं लगभग skip करने वाला था, और यही सबसे valuable निकला। Repo में एक TypeScript evaluation framework है जो LLMs को judges के रूप में इस्तेमाल करता है। यह scoring rubrics भी auto-generate करता है।

जो चीज मुझे impress किया वो था position-bias mitigation। दो responses को side by side compare करते वक्त, framework order बदलकर दो बार evaluate करता है। इससे जो response पहले दिखे उसे ज्यादा favorable rate करने की tendency का मुकाबला होता है। यह direct scoring और pairwise comparison दोनों support करता है। Evaluation pipeline बनाने का मतलब था कि अब मैं माप सकता था कि prompt changes ने performance सुधारी या नहीं, बजाय अंदाजे के।

एक बात जो repo solve नहीं करती: evaluation rubrics को अभी भी human calibration की जरूरत है। Auto-generated rubrics ने reasonable starting points दिए, लेकिन results भरोसेमंद बनाने से पहले मुझे अपने specific domain के लिए scoring weights adjust करने पड़े।

जब आपका agent कुछ गलत करे, तो model को blame करने से पहले context check करें। Repo यहां है।

न्यूज़लेटर से जुड़ें

नवीनतम AI पर इनसाइट्स पाएँ।