Making LLMs Write Code to Read 10M Tokens - How RLM Works
Bigger context windows don't make AI smarter. RLM flips the script by letting LLMs write code to selectively read massive documents instead of ingesting them whole.
“Does a bigger context window make AI smarter?”
The answer is no. In fact, the longer the input, the worse the model performs.
Context Rot - Why We Need RLM
LLMs predict the next token by computing probability distributions over input tokens. The problem is that as input grows longer, relevant information gets mixed with noise, making accurate computation increasingly difficult.
Even models that advertise 128K or 1M context windows are most accurate at around 10K tokens. Beyond 100K, performance drops sharply. This phenomenon is called Context Rot.
Think of it like reading a 500-page manual to answer a single question. You don’t need all 500 pages - you need the right three paragraphs. But an LLM with a massive context window tries to digest the entire book at once, and the signal gets buried under noise.
The Core Idea - “Don’t Read Everything, Extract What You Need”
RLM (Recursive Language Model) takes a fundamentally different approach. Instead of feeding 10M tokens into an LLM all at once, it stores the text in Python variables and lets the LLM write code to selectively read only the parts it needs.
In simple terms, it treats the LLM as a CPU and the massive text corpus as a hard drive. The model processes roughly 50K relevant tokens at a time while accurately extracting answers from documents exceeding 10M tokens.
This is a paradigm shift from “give the model more context” to “let the model decide what context it needs.”
Three Core Components
RLM Orchestrator
The controller that manages message history, controls the iteration loop, and determines when a final answer has been reached. It decides whether the model needs another pass through the data or has gathered enough information to respond.
LMHandler
A socket server that relays LLM API requests. Crucially, this handler enables LLM calls even during code execution - meaning the model can ask itself questions while processing data.
Environment / REPL
A Python sandbox where the massive text is stored in a context variable. The key function is llm_query(), which allows the LLM to recursively call itself during execution. This is where the “Recursive” in RLM gets its name.
The Execution Loop - Explore, Decompose, Aggregate
RLM follows a structured loop to work through massive documents:
Explore: The model starts by examining the data structure - running something like print(context[:500]) to understand what it’s working with.
Decompose: It splits the text into chunks and uses llm_query() to ask sub-questions about each chunk. Instead of one monolithic query over millions of tokens, it runs dozens of focused queries over manageable pieces.
Aggregate: Sub-answers are combined - either through Python logic (counting, filtering, sorting) or by asking the LLM to synthesize results from multiple chunks.
Terminate: When the model has gathered enough information, it calls FINAL(answer) to return the final response.
This loop can repeat as many times as needed, with each iteration building on what previous iterations discovered.
The Ralph Wiggum Connection - “It’s OK to Fail, Keep Iterating”
The recently trending Claude Code plugin Ralph Wiggum shares a philosophy with RLM while solving a different problem.
How Ralph Wiggum Works
When Claude finishes a task and attempts to exit, a Stop Hook intercepts the termination and re-injects the original prompt. On each iteration, Claude can see files modified in previous runs and the Git history, enabling it to progressively chip away at problems.
What They Share
Both approaches tackle problems that can’t be solved in a single LLM call by using iterative loops. Both treat failure as data - each iteration references previous results to improve.
Where They Differ
RLM specializes in ultra-large-scale text processing, with recursive LLM calls during code execution as its core mechanism. Ralph Wiggum focuses on autonomous development task execution, operating by intercepting session termination to drive continuous improvement.
Practical Tips for Implementation
Batch processing: Don’t call the LLM 1,000 times for 1,000 lines. Process 50 lines at a time across 20 calls instead.
Pre-filtering: Before calling llm_query(), use regex to narrow down candidates. Let deterministic code handle what it can before spending tokens on LLM reasoning.
Limit recursion depth: Depth 1 is usually sufficient. Going deeper compounds errors rather than improving accuracy.
History management: Use rolling windows or summary-based approaches to prevent context rot from accumulating across iterations.
What This Means
RLM represents a shift from brute-forcing larger context windows to designing smarter ways for models to read data. Instead of cramming everything into a single prompt, the model writes code, executes it, reads selectively, and calls itself recursively when needed.
This agent-style reasoning - where LLMs write and execute code to autonomously find the information they need - is becoming increasingly central to how we build AI systems. The future isn’t bigger context windows. It’s smarter access patterns.
Join the newsletter
Get updates on my latest projects, articles, and experiments with AI and web development.