Index
6 min read 2026

570,000 Lines of LLM Code Compiled Fine. It Was 20,171x Slower Than SQLite.

Someone benchmarked an LLM-written Rust reimplementation of SQLite. The gap between code that looks right and code that is right turned out to be five orders of magnitude.

A Rust reimplementation of SQLite, written entirely by an LLM, recently got benchmarked. It compiled. Tests passed. The code was clean, well-structured, and idiomatic Rust. On a basic primary key lookup, it was 20,171 times slower than SQLite.

That number stopped me. Not because LLM-generated code being slow is surprising, but because of where the slowness came from. The code wasn’t wrong in any way a compiler or test suite would catch. The B-tree was correctly implemented. The query planner existed. The storage engine worked. Every piece was individually defensible. The system as a whole was almost unusable.

I spent time reading through the benchmark analysis and the source code. The patterns I found keep showing up in LLM-generated projects, and I think they point to something fundamental about how these models write code.

The B-tree was there. The query planner ignored it.

In SQLite, a PRIMARY KEY lookup takes the B-tree path and finishes in O(log n) time. Four lines in where.c check for iPKey and route the query directly to the tree. This is one of those micro-optimizations that only makes sense if you understand how the entire system fits together.

The LLM-generated version had a B-tree implementation too. It worked correctly in isolation. The problem was that the query planner never called it for primary key lookups. The is_rowid_ref() function only recognized three literal strings: “rowid”, “rowid”, and “oid”. If you declared a column as id INTEGER PRIMARY KEY, the planner didn’t recognize it as a rowid alias. Every query hit a full table scan instead.

The math on this is brutal. For 100 rows queried 100 times, the B-tree path takes roughly 700 comparison steps. The full scan path takes over 10,000. But the real damage comes from algorithmic complexity: O(log n) per lookup becomes O(n), and across the full benchmark suite, that compounds into the 20,171x gap.

This is the kind of bug that no unit test catches unless you specifically write a benchmark. The B-tree works. The scan works. The planner picks the wrong one. Everything passes.

Safe defaults compound like interest

Here’s what made this case more interesting than a single routing bug. Even after accounting for the query planner issue, the reimplementation was still roughly 2,900 times slower. The remaining gap came from a stack of individually reasonable decisions.

Every query execution cloned the full AST and recompiled it to bytecode. SQLite reuses prepared statement handles. Both approaches are valid, but cloning an AST on every execution is expensive at scale.

Every page read allocated a fresh 4KB buffer on the heap. SQLite’s page cache returns a direct pointer to already-loaded memory. The LLM version chose the safe, obvious path: allocate, read, return. It works. It’s just orders of magnitude slower when you’re reading thousands of pages per query.

Every commit rebuilt the entire schema from scratch. SQLite compares a single integer cookie value. If the cookie hasn’t changed, the schema is still valid. The reimplementation didn’t have this concept, so it did the full work every time.

Every statement triggered a sync_all() call to flush all file metadata to disk. SQLite uses fdatasync(), which only flushes the file data and skips the metadata sync. The difference matters enormously on write-heavy workloads.

I want to call this the compound effect of defensive defaults. Each choice in isolation has a reasonable justification. Cloning the AST avoids ownership complexity in Rust. Allocating fresh buffers prevents use-after-free bugs. Rebuilding the schema avoids stale cache issues. Calling sync_all() provides the strongest durability guarantee.

But performance costs multiply, not add. When four 10x penalties stack, you don’t get 40x slower. You get 10,000x slower. An LLM doesn’t reason about this compounding because it generates each function in relative isolation. It optimizes locally and pays globally.

82,000 lines to replace a cron one-liner

The same developer’s other LLM-generated project showed the same pattern in a different way. The problem: build artifacts in Rust’s target/ directory eat disk space over time. The LLM’s solution: an 82,000-line Rust daemon with seven dashboards and a Bayesian scoring engine to decide which artifacts to clean up.

The existing solution is find ./target -type f -atime +30 -delete, a single line in a cron job. Zero dependencies. Or cargo-sweep, an official community tool that already exists and handles edge cases the daemon doesn’t.

The LLM-generated project pulled in 192 dependencies. For reference, ripgrep, one of the most sophisticated search tools in the Rust ecosystem, uses 61.

This is a pattern I keep seeing: LLMs build what you ask for, not what you need. If you prompt “build a system that intelligently manages Rust build artifacts with monitoring and scoring,” you get exactly that. The model has no mechanism to step back and ask whether the problem requires a system at all. It doesn’t know that target/ directory size is a perennial complaint in the Rust community with well-known solutions. It doesn’t consider the maintenance cost of 192 dependencies versus zero.

The research points the same direction

I was curious whether these two projects were outliers, so I looked at the broader research. They’re not.

METR ran a randomized controlled trial with 16 experienced open-source developers. The group using AI tools completed tasks 19% slower than the control group. The part that stuck with me: after the experiment ended, the AI group believed they had been 20% faster. The subjective experience of productivity was inverted from the measured reality.

GitClear analyzed 210 million lines of code and found that copy-pasted code overtook refactored code for the first time. The trend correlates directly with AI coding tool adoption. Code is being added faster than it’s being improved.

Google’s DORA 2024 report found that a 25% increase in AI adoption correlated with a 7.2% drop in deployment stability. More AI-generated code going into production, more incidents coming out.

The Mercury benchmark from NeurIPS 2024 added efficiency metrics to the standard coding benchmarks. When you measure not just “does it produce correct output” but “does it produce correct output without wasting resources,” pass rates dropped below 50%.

None of this means LLMs are useless for coding. I use them constantly. But it does mean that “compiles and passes tests” is a dangerously low bar. The gap between plausible code and correct code is where the real engineering happens.

What this actually demands from developers

The core problem isn’t that LLMs write bad code. They write code that is locally coherent and globally incoherent. Each function makes sense. The system doesn’t. This is the exact failure mode that traditional testing misses, because tests verify local behavior.

What’s needed is evaluation that targets the gaps. Benchmarks, not just tests. Performance budgets in CI, not just correctness checks. Architectural review that asks “why does this module exist” before checking whether it works. Dependency audits that compare the solution’s complexity against the problem’s complexity.

The question isn’t “does this code look right?” It’s “how do we prove it’s right?” And proving it requires the kind of systems-level thinking that LLMs currently lack.

The gap between what you asked for and what production demands is where engineering judgment lives. Without measurement, code generation is just token generation.

Join the newsletter

Get updates on my latest projects, articles, and experiments with AI and web development.