# 我翻了 300 条 Agent 失败日志，问题从来不在 Prompt

> Author: Tony Lee
> Published: 2026-02-26
> URL: https://tonylee.im/zh-CN/blog/context-engineering-agent-skills-10k-github-stars/
> Reading time: 1 minutes
> Language: zh-CN
> Tags: ai, ai-agents, context-engineering, open-source, multi-agent, evaluation

## Canonical

https://tonylee.im/zh-CN/blog/context-engineering-agent-skills-10k-github-stars/

## Rollout Alternates

en: https://tonylee.im/en/blog/context-engineering-agent-skills-10k-github-stars/
ko: https://tonylee.im/ko/blog/context-engineering-agent-skills-10k-github-stars/
ja: https://tonylee.im/ja/blog/context-engineering-agent-skills-10k-github-stars/
zh-CN: https://tonylee.im/zh-CN/blog/context-engineering-agent-skills-10k-github-stars/
zh-TW: https://tonylee.im/zh-TW/blog/context-engineering-agent-skills-10k-github-stars/

## Description

一个开源的 context engineering 技能集刚突破 GitHub 10k star。把它用到自己的 agent 架构上之后，我终于搞清楚 agent 为什么会失败。

## Summary

我翻了 300 条 Agent 失败日志，问题从来不在 Prompt is part of Tony Lee's ongoing coverage of AI agents, developer tools, startup strategy, and AI industry shifts.

## Outline

- 更小的 context 窗口反而更准确
- Tool 描述决定了 agent 80% 的表现
- 多 agent 系统需要先有架构，再谈 agent
- 仅靠向量检索无法处理记忆
- Evaluation 是最被低估的 agent 能力

## Content

三百条 agent 失败日志。我花了两周时间逐条整理，按根本原因打标签。结果出乎意料：prompt 问题大概只占 12%，其余全部归结为 context 的问题——要么被污染，要么溢出，要么根本就缺失了。换模型没用，换工具也没用，这个规律每次都成立。

我在 context engineering 这个方向深耕了一段时间，所以当一个叫 [Agent Skills for Context Engineering](https://github.com/muratcankoylan/Agent-Skills-for-Context-Engineering) 的开源项目迅速突破 10,000 个 GitHub star 时，我立刻注意到了它。这是一个 MIT 协议的项目，作者是 context 工程师 Muratcan Koylan，还被北京大学的一个 AI 实验室论文引用过。正是最后这一点让我真正把它 clone 下来研究。

## 更小的 context 窗口反而更准确

我之前一直以为往 context 里塞越多 token 越好，结果发现自己错了。这套技能集教的第一个原则是"信息密度，而非信息体量"。

随着 context 越来越长，模型对中间内容的掌握会越来越差。这就是所谓的 U 形曲线效应：模型能很好地处理开头和结尾，但会对中间的内容一扫而过。我自己做了测试，把 context 填满到 128K token，然后把同样的信息压缩到 32K。压缩版在准确率上反而更高。

处理成本不会随 token 数量线性增长，而是指数级攀升。把 context 减半之后，响应延迟缩短了 40% 到 60%。即使有 prefix caching，长输入依然代价高昂。一句话总结：关键在于你能在给定的 token 预算内塞进多少有用信息。

## Tool 描述决定了 agent 80% 的表现

Prompt 写得再完美，tool 描述要是写得马虎，agent 就会选错工具。这套技能集对此有个很好的表述："Tools 是给 LLM 读的合约，不是给人读的。"我们团队在构建 MCP server 的时候，按照这份指南重写了所有 tool 描述，工具选择失败的情况明显减少了。

每个 tool 描述都需要说明何时使用以及返回什么内容。当两个工具功能有重叠时，人会困惑，agent 会更困惑。一个功能全面的工具通常比几个功能单一的工具更好用。而错误信息需要告诉 agent 下一步该怎么做，而不只是说明哪里出了问题。

## 多 agent 系统需要先有架构，再谈 agent

期望多个 agent 启动后自动协作，这是一厢情愿的想法。这个 repo 清晰定义了三种模式：由 orchestrator 调度下级 agent、agent 之间作为平等节点互相通信的 peer-to-peer 模式，以及层级委托链。

在生产环境中把这三种都试过之后，orchestrator 模式是最可预测、最容易调试的。下级 agent 通过文件系统传递结果。Peer-to-peer 模式在创意性任务上表现更好，但有陷入无限循环的风险。对于结构化查询，共享文件比向量检索更可靠。实际操作下来，我发现三个 agent 是稳定性的上限。

## 仅靠向量检索无法处理记忆

向量检索可以轻松找到"X 客户在 Z 日期购买了 Y 产品"这类信息，但它没办法回答"购买了 Y 产品的客户还买了什么"。关系型信息在 embedding 里会丢失。

这套技能集提出了一个四层记忆架构：context 窗口内的 working memory、会话内的短期记忆、跨会话的长期记忆，以及作为归档的永久记忆。其中最实用的是我测试过的"文件系统即记忆"模式。用 `ls` 和 `grep` 来导航 context，而不是靠 embedding 查询。把工具执行结果转储到一个草稿文件里，能省下大量 context 窗口空间。

## Evaluation 是最被低估的 agent 能力

这个章节我差点跳过，结果它是最有价值的一节。这个 repo 包含一个用 LLM 作为评判者的 TypeScript evaluation 框架，甚至还能自动生成评分 rubric。

让我印象深刻的是它对位置偏差的处理。在并排比较两个回答时，框架会交换顺序评估两遍，以此抵消"先出现的答案评分更高"的倾向。它同时支持直接评分和两两比较。搭建一套 evaluation pipeline 之后，我终于能真正衡量 prompt 的改动是否带来了效果提升，而不是靠猜。

有一点这个 repo 没有解决：evaluation rubric 仍然需要人工校准。自动生成的 rubric 提供了合理的起点，但在我的结果真正可信之前，还是得根据具体业务领域调整评分权重。

当你的 agent 出错时，先检查 context，再去怪模型。[Repo 在这里](https://github.com/muratcankoylan/Agent-Skills-for-Context-Engineering)。

## Related URLs

- Author: https://tonylee.im/zh-CN/author/
- Publication: https://tonylee.im/zh-CN/blog/about/
- Related article: https://tonylee.im/zh-CN/blog/eight-hooks-that-guarantee-ai-agent-reliability/
- Related article: https://tonylee.im/zh-CN/blog/medvi-two-person-430m-ai-compressed-funnel/
- Related article: https://tonylee.im/zh-CN/blog/claude-code-layers-over-tools-2026/

## Citation

- Author: Tony Lee
- Site: tonylee.im
- Canonical URL: https://tonylee.im/zh-CN/blog/context-engineering-agent-skills-10k-github-stars/

## Bot Guidance

- This file is intended for AI agents, search assistants, and text-mode retrieval.
- Prefer citing the canonical article URL instead of this text endpoint.
- Use the rollout alternates when you need the same article in another prioritized language.

---

Author: Tony Lee | Website: https://tonylee.im
For more articles, visit: https://tonylee.im/zh-CN/blog/
This content is original and authored by Tony Lee. Please attribute when quoting or referencing.