Anthropic Launched a Bug Bounty Program for AI Safety Defense

Anthropic is continuing its efforts to prioritize AI safety, launching a bug bounty program for its safety defenses.

Anthropic is one of the leading AI firms, founded by former OpenAI execs who were concerned the company wasn’t doing enough to prioritize safety. Throughout its development, Anthropic has continued to emphasize safety, launching a number of initiatives designed to better understand AI and safeguard against the potential dangers AI poses.

Illustrating the challenges that exist, Anthropic CEO Dario Amodei recently admitted that the company—and everyone else for that matter—doesn’t understand how it works.

People outside the field are often surprised and alarmed to learn that we do not understand how our own AI creations work. They are right to be concerned: this lack of understanding is essentially unprecedented in the history of technology. For several years, we (both Anthropic and the field at large) have been trying to solve this problem, to create the analogue of a highly precise and accurate MRI that would fully reveal the inner workings of an AI model. This goal has often felt very distant, but multiple recent breakthroughs have convinced me that we are now on the right track and have a real chance of success.

When a generative AI system does something, like summarize a financial document, we have no idea, at a specific or precise level, why it makes the choices it does—why it chooses certain words over others, or why it occasionally makes a mistake despite usually being accurate. As my friend and co-founder Chris Olah is fond of saying, generative AI systems are grown more than they are built—their internal mechanisms are “emergent” rather than directly designed. It’s a bit like growing a plant or a bacterial colony: we set the high-level conditions that direct and shape growth1, but the exact structure which emerges is unpredictable and difficult to understand or explain. Looking inside these systems, what we see are vast matrices of billions of numbers. These are somehow computing important cognitive tasks, but exactly how they do so isn’t obvious.

In response to the concerns about AI, and the potential risks it poses, Anthropic launched a bug bounty program in the hopes of identifying issues with its safety defenses.

Today, we’re launching a new bug bounty program to stress-test our latest safety measures. Similar to the program we announced last summer, we’re challenging researchers to find universal jailbreaks in safety classifiers that we haven’t yet deployed publicly. These safeguards are part of the advanced protections we’ve developed to help us meet the AI Safety Level-3 (ASL-3) Deployment Standard as part of our Responsible Scaling Policy, the framework that governs how we develop and deploy increasingly capable AI models safely.

The bug bounty program, which is in partnership with HackerOne, will test an updated version of our Constitutional Classifiers system. Constitutional Classifiers are a technique we built to guard against jailbreaks that could elicit information related to CBRN (chemical, biological, radiological, and nuclear) weapons. This system follows a list of principles that define what type of content should and shouldn’t be allowed when interacting with Claude, and focus narrowly on specific harms.

The company says the bounty rewards will be “up to $25,000 for verified universal jailbreaks found on the unreleased system.”

The company says its bug bounty program will help it stress-test its ASL-3 safeguards for advanced AI, leading to more safe and secure AI deployments.

Anthropic Launched a Bug Bounty Program for AI Safety Defense

Notice an error?

Ready to get started?