Index
4 min read

I Dug Through 300 Agent Failure Logs. The Problem Was Never the Prompt.

An open-source context engineering skillset just crossed 10k GitHub stars. After applying it to my own agent stack, I finally understand why agents fail.

Three hundred agent failure logs. I went through them over two weeks, tagging each one by root cause. The breakdown surprised me: prompt issues accounted for maybe 12%. The rest? Context was either contaminated, overflowing, or missing entirely. Swapping models didn’t help. Swapping tools didn’t help. The pattern held every time.

I’ve been deep in context engineering for a while now, so when an open-source project called Agent Skills for Context Engineering showed up and quickly crossed 10,000 GitHub stars, I paid attention. It’s MIT-licensed, built by a context engineer named Muratcan Koylan, and cited in a Peking University AI lab paper. That last part is what made me actually clone it.

Smaller context windows are more accurate

I assumed stuffing more tokens into the context would always help. I was wrong. The first principle this skillset teaches is “information density, not information volume.”

As context grows longer, models lose track of what’s in the middle. This is the U-curve effect: the model reads the beginning and end well but skims over everything between. I tested this myself by filling context to 128K tokens, then compressing the same information down to 32K. The compressed version scored higher on accuracy.

Processing cost doesn’t scale linearly with token count; it climbs exponentially. Cutting context in half shortened response latency by 40 to 60 percent. Even with prefix caching, long inputs remain expensive. The one-line summary: what matters is how much useful information you pack into a given token budget.

Tool descriptions determine 80% of agent performance

You can write a perfect prompt, but if your tool descriptions are sloppy, the agent picks the wrong tool. This skillset frames it well: “Tools are contracts read by LLMs, not humans.” When my team built an MCP server, we rewrote our tool descriptions following this guide. Tool selection failures dropped noticeably.

Each tool description needs to specify when to use it and what it returns. When two tools overlap in function, humans get confused, and agents get confused worse. One comprehensive tool usually beats several narrow ones. And error messages need to tell the agent what to do next, not just what went wrong.

Multi-agent systems need architecture before agents

Spinning up multiple agents and expecting them to collaborate automatically is wishful thinking. The repo defines three patterns clearly: an orchestrator directing subordinate agents, a peer-to-peer model where agents communicate as equals, and a hierarchical delegation chain.

After trying all three in production, the orchestrator pattern was the most predictable and the easiest to debug. Subordinate agents passed results through the file system. The peer-to-peer model worked better for creative tasks but risked infinite loops. For structured queries, shared files beat vector search. In practice, I found three agents to be the stability ceiling.

Vector search alone can’t handle memory

Vector search finds “Customer X bought Product Y on Date Z” easily. It cannot answer “What else did customers who bought Product Y also buy?” Relational information gets lost in embeddings.

The skillset proposes a four-tier memory architecture: working memory inside the context window, short-term memory within a session, long-term memory across sessions, and permanent memory as archives. The file-system-as-memory pattern was the most practical one I tested. You navigate context with ls and grep instead of embedding queries. Dumping tool results into a scratchpad file saved significant context window space.

Evaluation is the most underrated agent skill

This was the section I almost skipped, and it turned out to be the most valuable. The repo includes a TypeScript evaluation framework that uses LLMs as judges. It even auto-generates scoring rubrics.

What impressed me was the position-bias mitigation. When comparing two responses side by side, the framework evaluates twice with the order swapped. This counters the tendency to rate whichever answer appears first more favorably. It supports both direct scoring and pairwise comparison. Building an evaluation pipeline meant I could finally measure whether prompt changes actually improved performance instead of guessing.

One thing the repo doesn’t solve: evaluation rubrics still need human calibration. The auto-generated rubrics gave reasonable starting points, but I had to adjust scoring weights for my specific domain before the results became trustworthy.

When your agent gets something wrong, check the context before you blame the model. The repo is here.

Join the newsletter

Get updates on my latest projects, articles, and experiments with AI and web development.