A new mathematical proof asserts that large language models, the backbone of AI agents, cannot execute complex computational tasks reliably, casting doubt on promises of autonomous AI supremacy. Father-son researchers Varin Sikka of Stanford University and Vishal Sikka of Vian AI Systems published “Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models” on arXiv in July 2025 (arXiv:2507.07505), arguing that self-attention mechanisms cap processing at O(N² · d) complexity, where N is input tokens and d is embedding dimension.
Their Theorem 1 states: “Given a prompt of length N, which includes a computational task within it of complexity O(n^k) or higher, where n < N, an LLM, or an LLM-based agent, will unavoidably hallucinate in its response.” Proof relies on the time hierarchy theorem by Hartmanis and Stearns, showing tasks like token composition or matrix multiplication exceed LLM bounds, leading to inevitable errors or failures.
Examples include listing all k-length strings from n tokens (O(n^k)) or traveling salesman verification (O((n-1)! / 2)), both surpassing quadratic limits. Agents fare no better, as verification often matches or exceeds task complexity.
Origins of the Proof
Gizmodo highlighted the paper on January 23, 2026, noting it “pours some cold water on the idea that agentic AI… will be the vehicle for achieving artificial general intelligence” (Gizmodo). WIRED amplified it the same day, with Vishal Sikka declaring, “There is no way they can be reliable” (WIRED). A former SAP CTO who studied under AI pioneer John McCarthy, Sikka now leads Vianai, emphasizing high-stakes risks: “Exactly” to forgoing AI for nuclear plants.
The paper distinguishes pure LLMs from hybrids: “Our paper is saying that a pure LLM has this inherent limitation—but at the same time it is true that you can build components around LLMs that overcome those limitations.” Yet core operations remain bounded, even in reasoning models like o1, due to token limits and self-attention dominance.
Apple’s researchers echoed this in 2025, finding frontier reasoning models collapse beyond puzzle complexities, with effort peaking then dropping despite tokens: “We show that frontier LRMs face a complete accuracy collapse beyond certain complexities” (Apple Machine Learning Research).
Industry Pushback Emerges
Optimists counter with verification layers. Harmonic, backed by Robinhood’s Vlad Tenev, deploys Aristotle, encoding outputs in Lean for math proofs, topping reliability benchmarks. Cofounder Tudor Achim views hallucinations positively: “I think hallucinations are intrinsic to LLMs and also necessary for going beyond human intelligence.” Their focus: “mathematical superintelligence,” sidestepping unverifiable tasks like essays.
At Davos, Google’s Demis Hassabis touted hallucination reductions. Sentient’s Himanshu Tyagi notes corporate hesitation: “The value has not been delivered.” OpenAI admits: “accuracy will never reach 100 percent.” Sikka concedes guardrails may enable progress, but pure agentic autonomy falters on math walls.
X discussions reflect buzz, with users sharing Gizmodo links and debating AGI timelines, though no expert rebuttals surfaced yet.
Broader Echoes in Research
Prior works align: A 2024 paper proved hallucinations inevitable for general solvers, as LLMs can’t learn all computable functions. Multi-agent debates improve math reasoning, per South China Agricultural University findings, but scaling hits coordination overhead—Google’s agentic scaling law shows >45% single-agent accuracy often degrades ensembles.
DeepMind’s AlphaProof achieves IMO silvers via formal verification, not pure LLMs. Caltech’s agent tackled Andrews-Curtis counterexamples over millions of steps, but required hybrid reinforcement. These hybrids bypass pure transformer limits, yet Sikkas warn verification inherits complexity woes.
Implications ripple: Agent hype—2025’s “agentic AI” year—faces recalibration. Enterprises demand reliability for automation; math proofs quantify risks, urging hybrids over hype.
Paths Beyond the Barrier
Workarounds proliferate: Tool integration (MathJS, Wolfram), formal langs (Lean), multi-agent critique. Emergence’s MathViz-E agents outperform LLMs 86% vs. 64% on Common Core via solvers. Yet Sikkas insist: “Verification of a task is often harder than the task itself.”
Vishal Sikka told WIRED composites may evolve, but transformer cores persist. Alan Kay, Sikka ally, deems math “well-posed but beside the point,” prioritizing societal shifts. As Harmonic’s Achim posits, verified niches expand, but full AGI agents remain mathematically constrained.
For insiders, the proof demands scrutiny: Test Llama-3.2 on TSP(n=20) in 512 tokens—failure predicted. Industry races to hybrids, but the quadratic ceiling endures, reshaping agent bets from infinite scaling to engineered precision.


WebProNews is an iEntry Publication