In the rapidly evolving field of artificial intelligence, large reasoning models (LRMs) are pushing boundaries by simulating human-like thought processes through extended chains of thought (CoT). But a critical gap has emerged: how efficiently do these models “think”? A recent analysis from Nous Research highlights this oversight, arguing that while accuracy benchmarks abound, metrics for thinking efficiency remain conspicuously absent, potentially inflating costs and hindering real-world deployment.
The report details how LRMs, trained via reinforcement learning to generate lengthy CoT during inference, often outperform traditional large language models on complex tasks. Yet, this comes at a steep price—excessive token generation that bloats computational demands. Nous Research’s experiments across models like OpenAI’s o1, Anthropic’s Claude 3.5 Sonnet, and open-source alternatives reveal stark disparities: open models output 1.5 to 4 times more tokens than closed ones on identical tasks, with variances spiking up to 10 times on simpler problems.
Efficiency Metrics: A New Imperative
This inefficiency, dubbed “overthinking,” sees models churning out redundant reasoning steps, especially on easy questions. Drawing from a related study in arXiv, where researchers introduced Think-Bench to quantify such waste, Nous Research extends the conversation by measuring token usage per task type. For instance, on math problems, closed models like o1-mini averaged 200-300 tokens, while open variants like DeepSeek-R1 ballooned to over 1,000, underscoring a trade-off between accessibility and resource drain.
Industry insiders note that this isn’t just academic; it impacts scalability. As per a post on X from Artificial Analysis, reasoning models can consume up to 20 times more tokens than non-reasoning counterparts, driving up inference costs in production environments. Nous Research’s data aligns, showing closed models’ tighter CoT leading to 2-3x faster response times, a boon for applications in finance or healthcare where latency matters.
Bridging System 1 and System 2 Thinking
The benchmark gap ties into broader debates on dual-process theory in AI. A literature review in The Moonlight discusses S1-Bench, which evaluates intuitive (System 1) versus analytical (System 2) thinking in LRMs. Models excelling in deliberate reasoning often falter on quick heuristics, leading to overthinking. Nous Research proposes efficiency as a core metric, suggesting benchmarks that score not just accuracy but tokens per correct answer, normalized by task complexity.
Recent news amplifies this urgency. An Apple study covered in The Decoder found scaling limitations in models like Claude 3.7, where increased difficulty prompts less thinking, not more. Similarly, MarkTechPost’s February 2025 article on “Thinking Harder, Not Longer” argues for metrics prioritizing depth over length, echoing Nous Research’s call for standardized efficiency evaluations.
Industry Implications and Future Directions
For developers, this means rethinking training paradigms. X posts from researchers like Zhaopeng Tu highlight “underthinking” in o1-like models, where promising reasoning paths are abandoned prematurely. Nous Research’s findings suggest hybrid approaches—combining efficient CoT with pruning techniques—could cut token waste by 50%, based on their cross-model comparisons.
Looking ahead, benchmarks like those in GitHub’s Awesome-Reasoning-Foundation-Models repository are evolving to include efficiency. A July 2025 ranking in Labellerr places models like Qwen3-235B as efficiency leaders among open sources, but Nous Research warns that without a unified metric, progress stalls. As AI integrates deeper into enterprise, measuring thinking efficiency isn’t optional—it’s essential for sustainable innovation.
Open vs. Closed Models: A Token Tug-of-War
Delving deeper, Nous Research’s analysis exposes open models’ verbosity as a double-edged sword: greater transparency but higher overhead. On coding tasks, for example, models like Grok-3 generated 4x more tokens than o1, per their data, yet accuracy gains were marginal. This resonates with an arXiv survey on evaluating reasoning behavior beyond accuracy, emphasizing procedural efficiency.
Experts on X, including Nous Research’s own announcements, stress task-dependent variance—up to 10x on knowledge-intensive queries. Integrating insights from Dev Community’s guide on Qwen3, which touts it as a benchmark for open-source thinking, the path forward involves RL fine-tuning to dynamically allocate computation, as proposed in a February 2025 paper shared on the platform.
In sum, as LRMs mature, efficiency benchmarks could redefine success, ensuring models think smarter, not just longer.