Index
4 min read

7-Step Pipeline to Verify Code Written by AI Agents

When agents push 3,000 commits a day, humans can't review them all. Here's how to build a machine-verified pipeline that catches what people can't.

This is the hottest topic right now. Agents are churning out hundreds of commits a day, and no one can review them all.

Peter, a developer at OpenClaw, sometimes pushes over 3,000 commits in a single day. That’s far beyond what any human can process. It has become a task that humans simply cannot handle alone.

At first, I thought there was no solution. Then I read Ryan Carson’s “Code Factory” and the picture clicked. Instead of trying to read everything, you build a structure where machines verify the code.

Define Merge Rules in a Single JSON File

Write down which paths are high-risk and which checks must pass, all in one file. The key insight is that this prevents documentation and scripts from drifting apart.

  • High-risk paths require a Review Agent plus browser-based evidence
  • Low-risk paths can merge after passing a policy gate and CI alone

Run Qualification Checks Before CI

Running builds on PRs that haven’t even passed review is burning money. Place a risk-policy-gate in front of CI fanout. This alone cuts unnecessary CI costs significantly.

  • Fixed order: policy gate → Review Agent confirmation → CI fanout
  • Unqualified PRs never even enter the test/build stage

Never Trust a “Pass” from a Stale Commit

This is what Carson emphasized most. If a “pass” from an old commit lingers, the latest code merges without verification. Re-run reviews on every push, and block the gate if they don’t match.

  • A Review Check Run is valid only when it matches the headSha
  • Force a rerun on every synchronize event

Issue Rerun Requests from Exactly One Source

When multiple workflows request reruns, you get duplicate comments and race conditions. It seems trivial, but if you don’t fix this, the entire pipeline shakes.

  • Prevent duplicates with a Marker + sha:headSha pattern
  • Skip the request if the SHA was already submitted

Let Agents Handle the Fixes Too

When the Review Agent finds a problem, the Coding Agent patches it and pushes to the same branch. The sharpest insight from Carson’s post: pin the model version. Otherwise, you get different results every time, and reproducibility is gone.

  • Codex Action fixes → push → rerun trigger
  • Pinned model versions ensure reproducibility

Only Auto-Close Bot-to-Bot Conversations

Never touch threads where a human participated. Without this distinction, reviewer comments get buried.

  • Auto-resolve only after a clean current-head rerun
  • Threads with human comments stay open, always

Leave Visible, Verifiable Evidence

If the UI changed, don’t just take a screenshot. Require CI-verifiable evidence. Turn production incidents into test cases so the same failure never repeats.

  • Regression → harness gap issue → add test case → SLA tracking

Carson’s tool choices

For reference, here’s what Carson selected: Greptile as the code review agent, Codex Action for remediation, with three workflow files handling the heavy lifting greptile-rerun.yml for canonical reruns, greptile-auto-resolve-threads.yml for stale thread cleanup, and risk-policy-gate.yml for preflight policy.

Beyond correctness: visual verification

Everything above catches whether code is right or wrong. But in practice, you also need to verify how the output looks.

Two approaches stand out.

Nico Bailon’s visual-explainer renders terminal diffs as HTML pages instead of ASCII, making change sets immediately readable at a glance.

Chris Tate’s agent-browser takes a different direction. It compares actual browser screens pixel by pixel to catch CSS and layout breakage. Combined with bisect, it can pinpoint exactly which commit caused the regression.

I’ve been thinking about this while building codexBridge. Tracking which agent wrote which code isn’t enough with just session logs. You need a search structure that makes it easy to retrieve.

The bottom line

The answer to “who verifies code written by agents” is not humans. It’s a structure where machines judge the evidence that machines produced. That’s the answer.

Join the newsletter

Get updates on my latest projects, articles, and experiments with AI and web development.