Index
5 min read

Paste Your Prompt Twice and Watch Accuracy Change

Google Research validated it across 7 models and 7 benchmarks. No training, no prompt engineering. Just copy-paste. I tested it and here's what actually happened.

My first reaction was that someone was pranking me. Paste your prompt twice and the model gets smarter? It sounds like the kind of advice you’d ignore on Reddit. But then I looked at the paper: Google Research, seven models, seven benchmarks, reproducible results. One test jumped from 21% to 97% accuracy. I stopped laughing.

The paper is called Prompt Repetition Improves Non-Reasoning LLMs (arXiv 2512.14982), authored by Yaniv Leviathan, Matan Kalman, and Yossi Matias. The core claim is simple: if you repeat your full prompt twice, you get measurably better answers, with no training, no fine-tuning, and no clever prompt engineering tricks.

Why the Model Reads Your Context Blind

To understand why this works, you need to think about how a transformer actually processes your input. Tokens are processed left-to-right. When the model is reading through your pasted document or your long context, it has no idea what question is coming at the end. It’s encoding all that text without any awareness of what it’s supposed to care about.

This is the structural problem. The attention mechanism works so that earlier tokens cannot attend to later tokens. So by the time your model reaches the word “summarize” or “which of these” at the end of your prompt, all the heavy lifting of encoding the context has already happened, without any signal from the question itself.

With short prompts, this barely matters. But with long context, a long pasted document, or a complex few-shot setup, the mismatch becomes real. The model reads hundreds or thousands of tokens of context in a kind of question-unaware state, then gets asked to do something specific with it.

What Repeating the Prompt Actually Does

When you send [context + question][context + question], something structurally different happens. During the second pass, the model reads the context again, but this time it already has the question in its attention window. The question has been encoded. Every token in the context can now attend to the question tokens.

That’s the mechanism. Not magic, not a hallucination fix, just giving the model a second pass where it actually knows what it’s looking for.

All seven models tested showed improvement: models from the Gemini, GPT, Claude, and DeepSeek families. Output length didn’t increase. Wall-clock latency was essentially unchanged, because transformer prefill runs in parallel on hardware. Doubling the input tokens costs almost nothing in time. The accuracy gains, though, were not marginal. Jumping from 21% to 97% on one benchmark is not a rounding error.

Three Times Is Just Waste

The natural follow-up question is whether three repetitions would be even better. The answer is mostly no. The paper found that two repetitions capture essentially all the benefit. Going to three means tripling your input token cost for negligible additional improvement. The formula they recommend is <QUERY><QUERY>, not <QUERY><QUERY><QUERY>.

The reason two works so well is the parallel prefill point. Doubling tokens on modern inference hardware barely moves the latency needle. Three times does not give you a third structural benefit, it just adds cost.

When This Does Nothing

This technique has a real failure mode, and you should know it before you start pasting everything twice.

Short, simple questions show no measurable difference. If your prompt is “what’s the capital of France,” repeating it accomplishes nothing. The technique only matters when you have long context combined with a complex question. Think: document analysis with pasted content, multi-step reasoning over a long passage, few-shot prompts with many examples. That’s the zone where the structural asymmetry actually hurts, and where repetition fixes it.

More importantly: the paper title says non-reasoning LLMs for a reason. Models running in a reasoning or extended thinking mode, like o1 or Gemini with thinking enabled, already do something functionally similar internally. They revisit context with question-awareness as part of their reasoning chain. For those models, the external repetition adds cost and provides no benefit. Check what mode you’re running before applying this.

The Cost Math

Input tokens double. That sounds alarming until you check pricing. On most APIs, input tokens cost significantly less than output tokens. Doubling input on a long-context task typically means a cost increase well under 50% for the full call, often closer to 20-30% depending on the ratio of context to response.

If you’re running tasks where wrong answers mean retries, that cost math flips entirely. One correct response is cheaper than two attempts at half the accuracy. For high-stakes queries where you’re currently doing multiple calls or verification steps, the net cost might actually go down.

The Part I Couldn’t Believe Until I Tried It

I ran this on a long-context analysis task I use regularly, one where I paste a document and ask the model to identify specific patterns across it. The difference was visible immediately. Not dramatic every time, but consistent. Responses felt more anchored to the specific question rather than generic summaries of the document.

It’s the kind of thing that feels obvious in retrospect. Of course the model does better when it reads the context knowing what to look for. The surprise is that nobody had systematically validated it across this many models before.

The paper is worth reading if you run any kind of long-context workload: Prompt Repetition Improves Non-Reasoning LLMs by Yaniv Leviathan, Matan Kalman, and Yossi Matias at Google Research.

Join the newsletter

Get updates on my latest projects, articles, and experiments with AI and web development.