Anthropic CEO Dario Amodei has made a disturbing admission, saying his company doesn’t understand how the AI models it is developing actually work.
Anthropic has established itself as one of the more safety-conscious AI firms, even being founded by former OpenAI execs who were concerned about the latter’s apparent abandonment of its mission to focus on safe AI development.
In a blog post on his personal site, Amodei makes the case that technology’s advancement cannot be stopped, but it can be steered.
People outside the field are often surprised and alarmed to learn that we do not understand how our own AI creations work. They are right to be concerned: this lack of understanding is essentially unprecedented in the history of technology. For several years, we (both Anthropic and the field at large) have been trying to solve this problem, to create the analogue of a highly precise and accurate MRI that would fully reveal the inner workings of an AI model. This goal has often felt very distant, but multiple recent breakthroughs have convinced me that we are now on the right track and have a real chance of success.
At the same time, the field of AI as a whole is further ahead than our efforts at interpretability, and is itself advancing very quickly. We therefore must move fast if we want interpretability to mature in time to matter. This post makes the case for interpretability: what it is, why AI will go better if we have it, and what all of us can do to help it win the race.
AI’s Unprecedented Challenge
Amodei then goes on to highlight the difference between AI and other types of technology. When other forms of technology do something, it is a direct result of being programmed to do that. Even if the technology makes a mistake or an error, it is always because of a mistake or error in its programming. In contrast, AI does many things that its developers don’t understand, or don’t understand the how and why of what it does.
When a generative AI system does something, like summarize a financial document, we have no idea, at a specific or precise level, why it makes the choices it does—why it chooses certain words over others, or why it occasionally makes a mistake despite usually being accurate. As my friend and co-founder Chris Olah is fond of saying, generative AI systems are grown more than they are built—their internal mechanisms are “emergent” rather than directly designed. It’s a bit like growing a plant or a bacterial colony: we set the high-level conditions that direct and shape growth1, but the exact structure which emerges is unpredictable and difficult to understand or explain. Looking inside these systems, what we see are vast matrices of billions of numbers. These are somehow computing important cognitive tasks, but exactly how they do so isn’t obvious.
Amodei says this opacity is exactly what makes AI far riskier than previous technologies.
Many of the risks and worries associated with generative AI are ultimately consequences of this opacity, and would be much easier to address if the models were interpretable. For example, AI researchers often worry about misaligned systems that could take harmful actions not intended by their creators. Our inability to understand models’ internal mechanisms means that we cannot meaningfully predict such behaviors, and therefore struggle to rule them out; indeed, models do exhibit unexpected emergent behaviors, though none that have yet risen to major levels of concern. More subtly, the same opacity makes it hard to find definitive evidence supporting the existence of these risks at a large scale, making it hard to rally support for addressing them—and indeed, hard to know for sure how dangerous they are.
The exec does say that, despite the possibility of AI learning to deceive humans or gain power, there doesn’t appear to be any solid evidence of any AI actually scheming to do either. But Amodei does say that at least part of the reason there’s been no solid evidence to date is precisely because there’s not effective way to know what AI models are doing, or how they’re doing it.
But by the same token, we’ve never seen any solid evidence in truly real-world scenarios of deception and power-seeking3 because we can’t “catch the models red-handed” thinking power-hungry, deceitful thoughts. What we’re left with is vague theoretical arguments that deceit or power-seeking might have the incentive to emerge during the training process, which some people find thoroughly compelling and others laughably unconvincing. Honestly I can sympathize with both reactions, and this might be a clue as to why the debate over this risk has become so polarized.
A Possible Solution
Amodei makes the case for “mechanistic interpretability,” the term used for trying to understand how complex AIs work internally, especially the decision-making process.
Our long-run aspiration is to be able to look at a state-of-the-art model and essentially do a “brain scan”: a checkup that has a high probability of identifying a wide range of issues including tendencies to lie or deceive, power-seeking, flaws in jailbreaks, cognitive strengths and weaknesses of the model as a whole, and much more. This would then be used in tandem with the various techniques for training and aligning models, a bit like how a doctor might do an MRI to diagnose a disease, then prescribe a drug to treat it, then do another MRI to see how the treatment is progressing, and so on.
Despite the promise interpretability offers, Amodei is very frank about the challenges involved in implementing, especially in time.
On one hand, recent progress—especially the results on circuits and on interpretability-based testing of models—has made me feel that we are on the verge of cracking interpretability in a big way. Although the task ahead of us is Herculean, I can see a realistic path towards interpretability being a sophisticated and reliable way to diagnose problems in even very advanced AI—a true “MRI for AI”. In fact, on its current trajectory I would bet strongly in favor of interpretability reaching this point within 5-10 years.
On the other hand, I worry that AI itself is advancing so quickly that we might not have even this much time. As I’ve written elsewhere, we could have AI systems equivalent to a “country of geniuses in a datacenter” as soon as 2026 or 2027. I am very concerned about deploying such systems without a better handle on interpretability. These systems will be absolutely central to the economy, technology, and national security, and will be capable of so much autonomy that I consider it basically unacceptable for humanity to be totally ignorant of how they work.
Conclusion
Despite the challenges involved in safely developing AI, it is reassuring to see the CEO of one of the top AI firms reaffirm his commitment to safe AI development, especially expressing his belief that remaining ignorant of how AI works is unacceptable.