Google DeepMind Wants to Measure How Close We Are to AGI — And Its New Framework Changes the Conversation

For years, the artificial intelligence industry has thrown around the term “artificial general intelligence” like a finish line everyone’s racing toward but nobody can quite see. Ask five researchers what AGI means, and you’ll get six answers. Google DeepMind thinks that’s a problem — and it’s proposing a fix that could reshape how the entire field benchmarks progress, allocates capital, and sets expectations for what’s coming next.

In a detailed research paper and accompanying blog post, Google DeepMind introduced what it calls a cognitive framework for measuring AGI. The paper, authored by a team of DeepMind researchers, doesn’t claim to have built AGI. It claims something arguably more useful right now: a structured way to talk about it, measure progress toward it, and avoid the hype cycles that have plagued AI discourse for decades.

The timing is not accidental.

We’re in a period where every major AI lab — OpenAI, Anthropic, Meta, Mistral, and Google itself — is shipping increasingly capable models at a pace that has outstripped the field’s ability to evaluate them coherently. Benchmarks that were considered difficult two years ago are now saturated. New ones appear monthly. And the public conversation about whether we’re “close” to AGI remains muddled by competing definitions, marketing incentives, and genuine scientific disagreement.

Google DeepMind’s framework attempts to cut through that noise by proposing a taxonomy — a leveled classification system that grades AI systems not on a single binary (AGI or not AGI) but across a spectrum of capability and generality. Think of it less as a scorecard and more as a periodic table: a shared vocabulary that lets researchers, policymakers, and investors talk about the same thing when they say “progress.”

The framework establishes six levels of AGI, ranging from Level 0 (no AI) through Level 5 (superhuman AGI). Each level is defined by performance benchmarks relative to human ability across a broad range of cognitive tasks. Level 1, which DeepMind labels “Emerging,” corresponds roughly to where current frontier models like GPT-4 and Gemini sit — systems that match or exceed the median human on some tasks but fall short on many others. Level 2, “Competent,” would describe a system that performs at least at the 50th percentile of skilled adults across most cognitive tasks. And so on up the ladder, with Level 5 representing a system that outperforms all humans at essentially everything cognitive.

But here’s what makes the framework more than an academic exercise. It separates capability from generality. A system can be extraordinarily powerful in a narrow domain — AlphaFold predicting protein structures, for instance — without being general at all. Conversely, a chatbot might handle a wide range of topics at a mediocre level. The DeepMind framework plots these as two independent axes, creating a matrix rather than a single line. A “Narrow Level 5” system is a superhuman specialist. A “General Level 1” system is a jack-of-all-trades that’s not particularly good at any of them. Current large language models, by this accounting, are somewhere around General Level 1 — impressive breadth, uneven depth.

This distinction matters enormously for business and policy.

If you’re a Fortune 500 company deciding how much to invest in AI integration, the question isn’t just “how smart is this model?” It’s “how smart is it at the specific things I need, and how reliably?” The DeepMind framework gives language for that conversation. A company deploying AI for customer service might need a Narrow Level 3 system. One attempting to automate scientific research might need General Level 4. The framework makes these requirements expressible and comparable.

The paper also takes a pointed stance on what AGI is not. It explicitly argues against definitions that require consciousness, sentience, or embodiment. “These properties are interesting,” the researchers write, but they are neither necessary nor sufficient for a system to qualify as AGI under the proposed framework. What matters is what the system can do, not what it experiences. This is a pragmatic choice — and a controversial one. It sidesteps decades of philosophical debate in favor of measurable outputs. Some will see that as refreshingly practical. Others will argue it misses the point entirely.

The research team also addresses a question that haunts every AI benchmark: what happens when the test becomes the training data? They acknowledge that static benchmarks are insufficient and propose that any serious measurement of AGI progress must include dynamic, evolving evaluations — tasks that can be updated as models improve, preventing the kind of benchmark saturation that has already rendered tests like MMLU less informative than they once were. The paper doesn’t prescribe specific new benchmarks but lays out principles for what good ones would look like: broad, adaptive, resistant to memorization, and grounded in real-world task performance rather than abstract pattern matching.

This framework arrives amid a broader industry reckoning about measurement and honesty. OpenAI has its own internal levels for tracking progress toward AGI, reportedly using a five-level scale that it shared with employees, as Bloomberg reported last year. Anthropic CEO Dario Amodei published a lengthy essay in late 2024 outlining what he believes powerful AI could accomplish within a few years — a vision that implicitly assumes rapid movement up something like the scale DeepMind is now formalizing. Meta’s Yann LeCun, meanwhile, has consistently argued that current LLM architectures are fundamentally incapable of reaching AGI, no matter how much you scale them. The DeepMind framework doesn’t resolve these disagreements, but it gives them a common coordinate system.

And the stakes are real. Not theoretical. Not distant.

Governments worldwide are drafting AI regulation based partly on capability assessments. The EU AI Act classifies systems by risk level. The U.S. executive order on AI, signed in October 2023, established reporting thresholds based on compute used in training — a crude proxy for capability. A shared, granular framework for measuring what AI systems can actually do would give regulators something far more useful than counting floating-point operations. DeepMind’s researchers explicitly note this policy relevance, arguing that their framework could inform “risk assessments and governance decisions” by providing clearer benchmarks for when specific safety measures should kick in.

The financial implications are equally significant. Venture capital and public market valuations in AI are driven substantially by perceived proximity to AGI. When Sam Altman says AGI is close, markets move. When skeptics push back, sentiment shifts. A standardized measurement framework wouldn’t eliminate speculation, but it could ground it. Investors could ask not “is this company building AGI?” but “what level of capability and generality has this company’s system demonstrated, and on what tasks?” That’s a much more answerable question.

There’s a self-interested dimension to Google’s move here, of course. By proposing the framework, DeepMind positions itself as the standard-setter — the organization defining the terms of the race. If the industry adopts this taxonomy, every future capability announcement gets measured against DeepMind’s yardstick. That’s a powerful position. It’s also the kind of thing that historically draws pushback from competitors who’d rather set their own terms.

Still, the paper has been received with notable seriousness across the research community. Several prominent AI researchers have praised its clarity and practical orientation, even while quibbling with specific level definitions or the exclusion of embodiment. The framework’s emphasis on separating capability from generality has been particularly well received, as it captures a nuance that most public discussions of AI completely miss.

One of the more provocative implications of the framework is how it reframes the current moment. By DeepMind’s own assessment, we are at the very beginning — Level 1 on the general scale. The systems that have captivated the world, generated billions in revenue, and triggered a global infrastructure buildout are, in this taxonomy, “Emerging.” That’s not a dismissal of their power. It’s a calibration. And it suggests that the distance between where we are and where AGI skeptics and enthusiasts alike imagine we’re heading is longer than the hype cycle implies.

Or maybe not. The framework is deliberately agnostic about timelines. It measures capability, not velocity. Whether the jump from Level 1 to Level 2 takes two years or twenty is an empirical question the framework can track but not predict. What it does provide is a way to know when we’ve arrived — at each level — rather than arguing endlessly about whether we have.

The broader question is whether the AI industry will actually adopt a shared measurement standard, or whether the incentives to define success on your own terms will prove too strong. History offers mixed precedents. The field has adopted shared benchmarks before — ImageNet, GLUE, SuperGLUE — but these were narrow and technical. A framework that aspires to measure something as loaded as “progress toward AGI” carries political and commercial weight that academic benchmarks never did.

Google DeepMind is betting that the need for clarity will win out over the desire for ambiguity. It’s a bet worth watching. Because right now, the AI industry is building trillion-dollar infrastructure on top of a concept — AGI — that it can’t even define consistently. If DeepMind’s framework sticks, it won’t just measure progress. It will discipline the conversation. And in a field this consequential, that might matter more than any single model release.

Google DeepMind Wants to Measure How Close We Are to AGI — And Its New Framework Changes the Conversation

Notice an error?

Ready to get started?