In a striking demonstration of ingenuity over brute force, Poetiq, a six-person startup founded by former Google DeepMind researchers, has topped the ARC-AGI-2 benchmark, surpassing efforts from Google and Anthropic while spending just $40,000 on hardware. The company emerged from stealth with a $45.8 million seed round, signaling investor confidence in its meta-system approach that enhances existing large language models without retraining.
Launched in June 2025 by co-CEOs Shumeet Baluja and Ian Fischer, Poetiq leverages recursive self-improvement to generate specialized “expert agents” for complex tasks. Clients supply a problem and a few hundred examples, far fewer than the thousands needed for traditional fine-tuning. This layer sits atop models like OpenAI’s GPT, Google’s Gemini, Anthropic’s Claude, and Meta’s Llama, optimizing for accuracy and efficiency. Puck News detailed how the team achieved one of the highest scores on ARC-AGI-2 in six months.
The ARC-AGI-2, created by François Chollet in 2019, tests abstract reasoning and generalization—skills where LLMs traditionally falter. Poetiq’s system hit 54% accuracy on the semi-private set using Gemini 3 Pro, beating Gemini 3 Deep Think’s 45% at half the cost per task ($30.57 vs. $77.16), as verified by ARC Prize. Later, integrating GPT-5.2 X-High pushed public eval accuracy to 75%, exceeding prior records by 16 points at $8 per task. ARC Prize confirmed these refinements redraw performance frontiers.
Recursive Self-Improvement Unlocks Hidden Potential
“LLMs are impressive databases that encode a vast amount of humanity’s collective knowledge. They are simply not the best tools for deep reasoning,” Baluja told Pulse 2.0. Poetiq’s meta-system uses iterative loops: generate solutions, critique, refine, and verify. It employs self-auditing to halt at optimal points, averaging fewer than two requests per ARC problem. This avoids wasteful compute, contrasting with reinforcement learning’s demands.
The open-sourced GitHub repo allows reproduction of Poetiq’s configs, showing pure Gemini-based setups. On ARC-AGI-1 public evals, it outperformed baselines across cost-performance curves. ARC Prize noted similar gains on Claude Opus 4.5, though at higher cost. Poetiq’s adaptability shone post-GPT-5.2 release, integrating it hours later for new highs. OpenAI’s Greg Brockman tweeted recognition of exceeding human baselines, per PR Newswire.
Founded after Baluja and Fischer’s decade at DeepMind, Poetiq’s team boasts 53 years combined experience. Garry Tan, Y Combinator CEO, praised the feat: “Getting to the top of ARC-AGI is no small feat, and recursive improvement a powerful milestone.” A NeurIPS talk with Fischer explored ensembles, voting, and system optimization sans model weights. Y Combinator’s X post highlighted this.
Massive Seed Backs Frugal Innovation
The $45.8 million seed, co-led by FYRFLY Venture Partners and Surface Ventures, included Y Combinator, 468 Capital, Operator Collective, Hico Ventures, and Neuron Venture Partners. “That Poetiq managed to top ARC-AGI within six months of launching is remarkable,” said Philipp Stauffer of FYRFLY. Gyan Kapur of Surface added, “Poetiq doesn’t need to outcompete frontier models… it enhances any combination of LLMs.” VentureBurn covered the round.
Allison Barr Allen of Operator Collective echoed excitement on X: “They have raised a $45.8M seed round after beating industry-leading benchmarks with a small team of 6.” Poetiq’s Miami HQ and business/productivity software focus, per PitchBook, position it for enterprise. Unlike GPU-heavy rivals, its $40K hardware bill underscores efficiency. Her X post celebrated the partnership.
Investors see enterprise potential in reasoning boosts for claims triage, fraud detection, and support. MIT’s Project NANDA found 95% of GenAI pilots lack P&L impact due to reliability issues—Poetiq targets this gap. ARC Prize’s 2025 report emphasized refinements like Poetiq’s as key, predicting integration into commercial APIs.
Benchmark Breakthrough Signals Paradigm Shift
From sub-5% in early 2025 to Poetiq’s 54%, ARC-AGI-2 progress accelerated. Humans average 60%, but Poetiq neared or passed on subsets. Reddit threads on r/singularity hailed it as breaking 50%, though debates noted benchmark overfitting risks. ARC Prize stressed private sets prevent this, verifying Poetiq’s semi-private SOTA.
Poetiq’s blog detailed Pareto frontier shifts on both ARC-1 and -2, using diverse tasks for self-improvement. It tackles noise and uncertainty in reasoning. Poetiq’s site confirmed verified results, teasing more benchmarks. The Rundown called it a shift to application-layer gains over scale.
Beyond ARC, Poetiq eyes retrieval and reasoning tasks. Harj Taggar tweeted: “Poetiq just crushed the ARC A.G.I. benchmark, beating Anthropic and Google, with only six people.” Techmeme amplified Puck’s scoop on the frugal win. As X buzz grows, Poetiq proves small teams can lead via smart orchestration.
Enterprise Edge and AGI Path Ahead
For businesses, Poetiq slashes costs: half of Gemini Deep Think, integrable with any LLM stack. It automates prompt engineering, a NeurIPS focus. ARC Prize’s technical report lauded domain-specific harnesses evolving general-purpose via DSPy-like methods. Poetiq’s model-agnostic design future-proofs against lab races.
“We used recursive self-improvement to produce specialized agents in a matter of hours,” Baluja noted, contrasting RL’s slowness. Grishin Robotics highlighted enterprise failures on integration—Poetiq bridges this. With funding, expansion targets AI product teams and researchers needing reliability.
Critics question transfer beyond ARC, but Poetiq’s multi-benchmark work and open code invite scrutiny. As Tan said, “You don’t always need a bigger model.” Poetiq’s rise challenges scale-alone dogma, betting on meta-systems for safe superintelligence—their bio’s bold claim.


WebProNews is an iEntry Publication