Index
3 min read Updated Feb 18, 2026

AI Approaches Human Reasoning for the First Time - Poetiq Breaks 50% on ARC-AGI-2

Poetiq's recursive meta-system became the first to surpass 50% on ARC-AGI-2, the benchmark designed to test true general intelligence. Here's how a 6-person team outperformed Google at half the cost.

Poetiq just made history on the ARC-AGI benchmark.

ARC-AGI is the test designed to evaluate whether AI possesses genuine general intelligence. It doesn’t ask models to regurgitate training data. Instead, it presents completely novel pattern problems and requires the system to infer the underlying rules on its own. Humans average around 60% accuracy. Until now, AI systems fell far short of that mark.

Why Poetiq’s Result Matters

  • First to break 50% on ARC-AGI-2 - officially verified by the ARC Prize Foundation at 54% accuracy
  • Half the cost of the previous state of the art - $30.57 per problem versus Gemini 3 Deep Think’s $77.16
  • A 6-person team with 53 years of combined experience from Google DeepMind outperformed the largest AI labs
  • Fully open-sourced approach and prompts available on GitHub

For context, leading AI models scored under 5% on ARC-AGI-2 in early 2025. The jump from sub-5% to over 50% in months signals something fundamental has shifted.

The Architecture - Recursive Reasoning Over Raw Scale

The core innovation is a meta-system that doesn’t train new models. Instead, it orchestrates existing LLMs through iterative loops of reasoning.

The system generates a candidate solution, critiques it, analyzes the feedback, and uses the LLM to refine the answer. Repeat. The prompt is merely the interface - the real intelligence emerges from this iterative refinement process.

This is a deliberate departure from standard chain-of-thought prompting. Rather than asking once and accepting the output, Poetiq’s system treats each answer as a draft to be improved through structured self-critique.

Self-Auditing - Knowing When to Stop

The most impressive capability is the self-auditing mechanism. The system autonomously determines when it has gathered sufficient information and when to terminate the reasoning process.

This isn’t just an engineering convenience - it’s a core economic mechanism. By averaging fewer than two LLM requests per ARC problem, the system minimizes unnecessary computation while maintaining accuracy. This is how a small team achieved superior results at half the cost of trillion-dollar competitors.

What This Proves

Following the Tiny Recursive Model (TRM) and RLM, Poetiq’s result is the strongest evidence yet that recursive reasoning architectures represent a viable path toward AGI.

The lesson isn’t about building bigger models or longer context windows. It’s about designing systems that think iteratively - generating, evaluating, and refining in structured loops. When the reasoning process itself becomes the product, raw model scale matters less than architecture design.

The full implementation, prompts, and methodology are available on GitHub.

Join the newsletter

Get updates on my latest projects, articles, and experiments with AI and web development.