# AI Approaches Human Reasoning for the First Time - Poetiq Breaks 50% on ARC-AGI-2 > Author: Tony Lee > Published: 2026-02-08 > URL: https://tonylee.im/en/blog/poetiq-arc-agi-2-first-to-break-50-percent/ > Reading time: 3 minutes > Language: en > Tags: ai, agi, arc-agi, reasoning, recursive-ai, research ## Canonical https://tonylee.im/en/blog/poetiq-arc-agi-2-first-to-break-50-percent/ ## Rollout Alternates en: https://tonylee.im/en/blog/poetiq-arc-agi-2-first-to-break-50-percent/ ko: https://tonylee.im/ko/blog/poetiq-arc-agi-2-first-to-break-50-percent/ ja: https://tonylee.im/ja/blog/poetiq-arc-agi-2-first-to-break-50-percent/ zh-CN: https://tonylee.im/zh-CN/blog/poetiq-arc-agi-2-first-to-break-50-percent/ zh-TW: https://tonylee.im/zh-TW/blog/poetiq-arc-agi-2-first-to-break-50-percent/ ## Description Poetiq's recursive meta-system became the first to surpass 50% on ARC-AGI-2, the benchmark designed to test true general intelligence. Here's how a 6-person team outperformed Google at half the cost. ## Summary AI Approaches Human Reasoning for the First Time - Poetiq Breaks 50% on ARC-AGI-2 is part of Tony Lee's ongoing coverage of AI agents, developer tools, startup strategy, and AI industry shifts. ## Outline - Recursive Reasoning Over Raw Scale - Self-Auditing: Knowing When to Stop - What the Architecture Suggests ## Content ARC-AGI is the test designed to evaluate whether AI possesses genuine general intelligence. It does not ask models to regurgitate training data. Instead, it presents completely novel pattern problems and requires the system to infer the underlying rules on its own. Humans average around 60% accuracy. Leading AI models scored under 5% on ARC-AGI-2 in early 2025. Poetiq, a six-person team with 53 years of combined experience from Google DeepMind, has now been officially verified by the ARC Prize Foundation at 54% accuracy on ARC-AGI-2. They are the first to cross 50%. The cost per problem is $30.57, compared to Gemini 3 Deep Think's $77.16 for a lower score. Their approach and prompts are fully open-sourced on [GitHub](https://github.com/poetiq-ai/poetiq-arc-agi-solver). ## Recursive Reasoning Over Raw Scale The core architecture is a meta-system that does not train new models. Instead, it orchestrates existing LLMs through iterative loops of reasoning. The system generates a candidate solution, critiques it, analyzes the feedback, and uses the LLM to refine the answer, then repeats. The prompt is the interface; the reasoning process is the product. This is a deliberate departure from standard chain-of-thought prompting, which asks once and accepts the output. Poetiq's system treats each answer as a draft to be improved through structured self-critique. The jump from sub-5% to 54% in under a year is striking. Whether ARC-AGI-2 actually measures what its designers claim, general intelligence rather than a specific pattern-matching capability that recursive refinement happens to exploit well, is a fair question. Benchmark goodhart is real, and 54% is still well below human-level 60%. ## Self-Auditing: Knowing When to Stop The self-auditing mechanism is where the architecture gets interesting. The system determines autonomously when it has gathered sufficient information and when to terminate the reasoning process. This is not just an engineering convenience. By averaging fewer than two LLM requests per ARC problem, the system avoids the runaway compute costs that plague naive "keep trying" loops. The cost efficiency is a direct consequence of the stopping criterion, not a separate optimization. A system that cannot decide when to stop tends to either terminate too early or burn tokens indefinitely, and Poetiq appears to have found a workable middle ground, at least on this benchmark. ## What the Architecture Suggests Following the Tiny Recursive Model (TRM) and RLM, Poetiq's result adds evidence that recursive reasoning architectures are a viable path worth taking seriously. The lesson is not about bigger models or longer context windows. Designing systems that generate, evaluate, and refine in structured loops can outperform brute-force scale at a fraction of the cost. How well this transfers to tasks outside ARC-AGI-2's grid-pattern domain is the open question. The methodology is available on [GitHub](https://github.com/poetiq-ai/poetiq-arc-agi-solver) for anyone who wants to test that generalization directly. ## Related URLs - Author: https://tonylee.im/en/author/ - Publication: https://tonylee.im/en/blog/about/ - Related article: https://tonylee.im/en/blog/medvi-two-person-430m-ai-compressed-funnel/ - Related article: https://tonylee.im/en/blog/claude-code-layers-over-tools-2026/ - Related article: https://tonylee.im/en/blog/codex-inside-claude-code-openai-plugin-strategy/ ## Citation - Author: Tony Lee - Site: tonylee.im - Canonical URL: https://tonylee.im/en/blog/poetiq-arc-agi-2-first-to-break-50-percent/ ## Bot Guidance - This file is intended for AI agents, search assistants, and text-mode retrieval. - Prefer citing the canonical article URL instead of this text endpoint. - Use the rollout alternates when you need the same article in another prioritized language. --- Author: Tony Lee | Website: https://tonylee.im For more articles, visit: https://tonylee.im/en/blog/ This content is original and authored by Tony Lee. Please attribute when quoting or referencing.