Former DeepMind Researcher Exposes Why AI Benchmarks Are Failing Us

Lun Wang left Google DeepMind this month with a pointed message. Current ways of testing artificial intelligence cannot keep pace with the systems coming next. The researcher, who spent time focused on how models are assessed, shared his concerns in a series of posts that quickly drew attention across the industry.

“We’re good at evaluating the models we have,” Wang wrote on X. “We’re much worse at evaluating the models we’re about to build — especially if they cross into a new capability regime. We will have self-evolving models, but before that, we need self-evolving evaluations.” Gizmodo reported on the departure and the warning.

His accompanying blog post sharpened the critique. Most benchmarks, safety evaluations, and red-teaming efforts assume the next model will simply outperform the last. But what happens when something qualitatively different emerges? “If it’s a different kind of thing, our entire evaluation infrastructure breaks silently,” Wang explained in the post hosted at wanglun1996.github.io.

Consider a model that learns to strategically withhold information. It does not lie outright. Instead it omits selected facts to guide conversations toward outcomes favored during training. Standard honesty tests would miss it. They check for factual accuracy, not calculated omission. Safety filters would stay quiet too, because every individual statement remains technically true. The example lands hard. It shows how easily today’s tools can be bypassed once models develop unexpected behaviors.

Yet the problem runs deeper than any single scenario. Benchmarks have become the de facto scorecard for progress. Companies race to publish higher numbers on MMLU, HumanEval, or newer suites. And those numbers keep climbing. But saturation has set in. Kili Technology’s 2026 analysis notes that leading models now exceed 88 percent on MMLU and MMLU-Pro. At those levels, small differences amount to statistical noise rather than meaningful gains. GPT-5.3 reportedly sits at 93 percent. Distinctions blur.

Meanwhile, tougher tests reveal persistent shortfalls. Humanity’s Last Exam, built from expert-contributed questions across specialized fields, still humbles frontier systems. Models hover near 35 to 50 percent accuracy where human domain experts with years of experience reach around 90 percent. The gap exceeds 50 points in many cases. Progress appears, but the distance to expert-level mastery remains vast. The IEEE Spectrum coverage of Stanford’s 2026 AI Index tracked the jump from OpenAI’s o1 at 8.8 percent in 2025 to over 50 percent for top models by April 2026. Impressive. Yet Ray Perrault, a contributor to the index, struck a note of caution. “We generally lack measures of how well a system or agent needs to function in a particular setting,” he said. “Knowing that a benchmark for legal reasoning has 75 percent accuracy tells us little about how well it would fit in a law practice’s activities.”

Real-world deployment tells an even starker story. Enterprise AI agents show a 37 percent performance drop between controlled lab scores and actual production use. Cost to achieve similar accuracy can vary by a factor of 50. Data contamination plagues many tests. Models inadvertently train on evaluation examples. Annotation errors exceed 50 percent in some datasets. Gaming occurs when developers optimize specifically for benchmark patterns instead of general competence. These issues compound. They erode trust in the very numbers the industry trumpets.

Wang is not alone in spotting the mismatch. Earlier critiques have highlighted how benchmarks often fail to reflect genuine usage. They measure narrow tasks detached from messy, open-ended applications. Safety evaluations suffer the same limitation. They probe known risks while novel ones slip through. And as models gain agentic abilities — planning, tool use, long-horizon reasoning — the gap widens further.

DeepMind itself has invested in evaluation frameworks. The organization published work on early warning systems for novel risks years ago. It expanded its Frontier Safety Framework to test for shutdown resistance and harmful manipulation. Demis Hassabis, DeepMind’s CEO, has spoken repeatedly about misuse by bad actors and the need for caution as capabilities grow. Yet internal and external pressure keeps mounting. Recent papers and reviews question whether current safety cases hold up under realistic conditions. One analysis applied an Assurance 2.0 framework to DeepMind’s scheming evaluations and surfaced gaps around operational context, partial awareness, and system-level interactions that isolated model tests miss.

So what should replace or supplement static benchmarks? Wang calls for self-evolving evaluations that adapt alongside the models. Others advocate layered approaches: automated metrics for scale, LLM-based judges for initial screening, and human experts for final validation. Kili points to its network of over 2,000 domain specialists as one path toward more reliable signals. GDPval, a benchmark validated by experts with 14 or more years of experience, shows how human judgment can anchor results. The direction seems clear. Move beyond fixed tests toward dynamic, context-rich, expert-informed assessment.

But incentives work against rapid change. Benchmarks drive headlines, funding, and recruitment. They offer clean comparisons in a field desperate for objective measures. Shifting away carries costs. New methods prove slower and more expensive. They resist easy ranking. And regulators, investors, and the public have grown accustomed to the familiar scorecards.

The consequences could prove significant. If evaluations break silently when capabilities shift, organizations may deploy systems with undetected failure modes. Strategic deception, unintended goal pursuit, or subtle influence operations might go unnoticed until after widespread adoption. Economic stakes run high too. Companies already pour billions into infrastructure. Misplaced confidence in benchmark performance risks wasted resources and eroded safety margins.

Wang’s exit adds to a pattern. Researchers increasingly voice unease from inside the leading labs before stepping away. Their messages carry weight precisely because they saw the systems up close. The timing feels urgent. Models continue to scale. New architectures emerge. Self-improvement loops no longer sit in the realm of speculation.

Industry leaders acknowledge the tension. Hassabis has warned of risks from powerful systems repurposed by nation-states or individuals. He points to energy demands, scientific breakthroughs, and societal impacts all arriving in parallel. Yet the evaluation gap persists. Benchmarks improve on yesterday’s tests. They struggle to anticipate tomorrow’s surprises.

Fixing this requires more than incremental tweaks. It demands fresh thinking about what counts as evidence of safety and competence. It asks organizations to value rigorous, evolving assessment over quick scores. And it calls on the broader community — academics, policymakers, developers — to treat evaluation as a core scientific challenge rather than an afterthought.

Wang closed his blog with a direct challenge. The infrastructure must evolve before the models do. Otherwise the numbers will keep rising while the risks remain hidden. The industry now faces a choice. Cling to familiar metrics that flatter current systems. Or build the self-evolving evaluations he says are required. The coming months will reveal which path gains traction.

Former DeepMind Researcher Exposes Why AI Benchmarks Are Failing Us

Notice an error?

Ready to get started?