Index
4 min read

My Agent Called a Failed API 5 Times—The Bug Wasn't in the Code

When an agent repeats the same failing API call, code review won't help. Traces are the new source code for debugging AI agents.

A bug hit production. My agent was repeating the same API call five times. I opened the code first, out of habit. The retry logic was fine. The function flow was normal. Not a single error in the logs.

The code had no answers. It wasn’t until I opened the trace that the cause became visible.

Agent code is an empty vessel

Open any agent’s source code and you’ll find a model specification, a list of tools, and a system prompt. That’s about it. Which tool to call when, what reasoning sequence to follow—none of that lives in the code.

Teams running LangGraph-based agents say the same thing repeatedly: “You can’t judge agent quality through code review.”

  • Same code, same input, different tool call patterns every time
  • Unlike a function like handleSubmit(), the branching logic simply doesn’t exist in the code
  • Testing GPT-5.2 with the same query 10 times yields roughly 40% consistency in tool call ordering
  • When errors occur, there’s no bug in the code, making reproduction impossible

This is the fundamental shift. In traditional software, the code is the behavior. In agents, the code is just the scaffolding. The actual behavior emerges at runtime, shaped by the model’s reasoning over whatever context it receives.

Traces are the new source code

A trace records every footstep the agent takes. What it reasoned at each step, which tool it called and why—all of it captured. The debugging, testing, and performance analysis we used to do through code now has to happen through traces.

When an agent sees an error message and repeats the same call anyway, that’s not a code bug. It’s a reasoning failure. And you can only see it in the trace.

  • Comparing traces before and after a prompt change reveals reasoning quality differences instantly
  • In LangSmith, loading a trace from a specific point into the playground works like setting a breakpoint
  • A single trace can show you the exact moment the agent’s reasoning went off track—something no amount of logging can replicate

Think of it this way: traditional debugging is reading a recipe to find the mistake. Agent debugging is watching the kitchen footage to see where the chef went wrong. The recipe might be perfect. The execution is where things break.

Testing fundamentally changes

In traditional software, you test before deployment and you’re done. Agents are non-deterministic, so you have to keep evaluating in production.

Without a pipeline that collects traces, builds eval datasets, and catches quality degradation or drift, you simply cannot operate agents at scale.

Teams that have adopted trace-based evaluation have seen measurable improvements in task success rates. The pattern is consistent: traces reveal failure modes that no pre-deployment test suite could predict.

  • Build an automated eval pipeline that samples production traces weekly
  • Pre-deployment testing alone cannot guarantee quality for non-deterministic systems
  • Monitoring without traces is like only checking whether the server is running
  • An agent can be “working normally” while executing completely wrong tasks—only traces catch this

Collaboration and product analytics happen on traces too

Code review happens on GitHub. Where does agent judgment review happen?

Observability platforms are taking that role. Teams are commenting on traces, sharing specific decision points, and reviewing agent reasoning the way they used to review pull requests. The collaboration model itself is shifting.

Product analytics follows the same pattern. When a metric says “30% of users are dissatisfied,” you can’t find the cause without opening traces. The agent might be completing tasks successfully by its own measure while completely missing what the user actually wanted.

  • Product analytics tools like Mixpanel and debugging tools are converging on traces as the shared substrate
  • Analyzing agent tool call patterns can reverse-engineer what features users actually need

The bottom line

In the agent era, code is the building blueprint and traces are the security camera footage. When something goes wrong in the building, you don’t unfold the blueprint first—you rewind the footage.

The teams getting agent quality right are the ones that shifted their center of gravity from code to traces. Not because code doesn’t matter, but because the interesting failures—the ones that cost you users and money—live in the runtime behavior that only traces capture.

Join the newsletter

Get updates on my latest projects, articles, and experiments with AI and web development.