Making LLMs Write Code to Read 10M Tokens - How RLM Works
Bigger context windows don't make AI smarter. RLM flips the script by letting LLMs write code to selectively read massive documents instead of ingesting them whole.
Bigger context windows do not make AI smarter. In practice, the longer the input, the worse the model performs on tasks requiring precision. RLM is an attempt to work around this by letting the model choose what to read rather than trying to digest everything at once.
Why Large Context Windows Fail
LLMs predict the next token by computing probability distributions over input tokens. As input grows longer, relevant information gets buried in noise, making accurate computation harder.
Even models that advertise 128K or 1M context windows perform most accurately around 10K tokens. Beyond 100K, performance drops sharply. This is called Context Rot.
The analogy that makes it concrete: you are reading a 500-page manual to answer a single question. You do not need all 500 pages. You need the right three paragraphs. An LLM with a massive context window tries to digest the entire book at once, and the signal gets lost. Larger windows delay the problem; they do not solve it.
The Core Idea: Extract What You Need
RLM (Recursive Language Model) takes a different approach. Instead of feeding 10M tokens into an LLM all at once, it stores the text in Python variables and lets the LLM write code to selectively read only the parts it needs.
The mental model: treat the LLM as a CPU and the massive text corpus as a hard drive. The model processes roughly 50K relevant tokens at a time while accurately extracting answers from documents exceeding 10M tokens. The shift is from “give the model more context” to “let the model decide what context it needs.”
Whether this scales reliably across diverse document structures is an open question. The approach works well on structured or semi-structured text where the model can write useful filtering code. On unstructured prose, the filtering step can itself become a bottleneck.
Three Core Components
The RLM Orchestrator is the controller that manages message history, controls the iteration loop, and determines when a final answer has been reached. It decides whether the model needs another pass through the data or has gathered enough information to respond.
The LMHandler is a socket server that relays LLM API requests. It enables LLM calls even during code execution, meaning the model can ask itself questions while processing data.
The Environment is a Python sandbox where the massive text is stored in a context variable. The key function is llm_query(), which allows the LLM to recursively call itself during execution. This is where the “Recursive” in RLM gets its name.
The Execution Loop: Explore, Decompose, Aggregate
RLM follows a structured loop to work through massive documents.
The model starts by examining the data structure, running something like print(context[:500]) to understand what it is working with. It then splits the text into chunks and uses llm_query() to ask sub-questions about each chunk. Instead of one monolithic query over millions of tokens, it runs dozens of focused queries over manageable pieces. Sub-answers are then combined through Python logic (counting, filtering, sorting) or by asking the LLM to synthesize results from multiple chunks. When the model has gathered enough information, it calls FINAL(answer) to return the final response.
This loop can repeat as many times as needed, with each iteration building on what previous iterations discovered.
Comparison with the Ralph Wiggum Pattern
The recently trending Claude Code plugin Ralph Wiggum shares a philosophy with RLM while solving a different problem.
When Claude finishes a task and attempts to exit, a Stop Hook intercepts the termination and re-injects the original prompt. On each iteration, Claude can see files modified in previous runs and the Git history, enabling it to progressively chip away at problems.
Both approaches tackle problems that cannot be solved in a single LLM call by using iterative loops. Both treat each iteration as an opportunity to improve on the previous result. The difference is scope: RLM specializes in ultra-large-scale text processing, with recursive LLM calls during code execution as its core mechanism. Ralph Wiggum focuses on autonomous development task execution, operating by intercepting session termination to drive continuous improvement.
Practical Notes for Implementation
Batch processing matters. Do not call the LLM 1,000 times for 1,000 lines. Processing 50 lines at a time across 20 calls is substantially cheaper and often more accurate.
Pre-filtering with regex before calling llm_query() lets deterministic code handle what it can before spending tokens on LLM reasoning. This is one of the more reliable ways to keep costs down.
Recursion depth should stay shallow. Depth 1 is usually sufficient. Going deeper compounds errors rather than improving accuracy, and the cost grows quickly.
History management across iterations needs active attention. Rolling windows or summary-based approaches prevent context rot from accumulating, which is somewhat ironic given that context rot is exactly what RLM is designed to avoid.
What This Points To
RLM is a shift away from brute-forcing larger context windows toward designing smarter ways for models to read data. Instead of cramming everything into a single prompt, the model writes code, executes it, reads selectively, and calls itself recursively when needed.
This agent-style reasoning, where LLMs write and execute code to find the information they need, is increasingly central to how practical AI systems get built. Smarter access patterns, not bigger windows, are what actually move the needle.
Join the newsletter
Get updates on my latest projects, articles, and experiments with AI and web development.