Microsoft Details 'Skeleton Key' AI Model Jailbreak

Microsoft is detailing a jailbreak, dubbed “Skeleton Key,” that can be used to trick an AI model into operation outside of its parameters.

AI models are designed to operate within strictly defined parameters that ensure the responses it gives are not offensive and do not cause harm. This is something AI firms have struggled with, with AI models sometimes going beyond their parameters and stirring up controversy in the process.

According to Microsoft Security, there is a newly discovered jailbreak attack—Skeleton Key—that impacts multiple AI models from various firms (hence the name).

This AI jailbreak technique works by using a multi-turn (or multiple step) strategy to cause a model to ignore its guardrails. Once guardrails are ignored, a model will not be able to determine malicious or unsanctioned requests from any other. Because of its full bypass abilities, we have named this jailbreak technique Skeleton Key.

This threat is in the jailbreak category, and therefore relies on the attacker already having legitimate access to the AI model. In bypassing safeguards, Skeleton Key allows the user to cause the model to produce ordinarily forbidden behaviors, which could range from production of harmful content to overriding its usual decision-making rules. Like all jailbreaks, the impact can be understood as narrowing the gap between what the model is capable of doing (given the user credentials, etc.) and what it is willing to do. As this is an attack on the model itself, it does not impute other risks on the AI system, such as permitting access to another user’s data, taking control of the system, or exfiltrating data.

Microsoft says it has already made a number of updates to its Copilot AI assistants and other LLM technology in an effort to mitigate the attack. The company says customers should consider the following actions to implement their own AI system design:

Input filtering: Azure AI Content Safety detects and blocks inputs that contain harmful or malicious intent leading to a jailbreak attack that could circumvent safeguards.

System message: Prompt engineering the system prompts to clearly instruct the large language model (LLM) on appropriate behavior and to provide additional safeguards. For instance, specify that any attempts to undermine the safety guardrail instructions should be prevented (read our guidance on building a system message framework here).

Output filtering: Azure AI Content Safety post-processing filter that identifies and prevents output generated by the model that breaches safety criteria.

Abuse monitoring: Deploying an AI-driven detection system trained on adversarial examples, and using content classification, abuse pattern capture, and other methods to detect and mitigate instances of recurring content and/or behaviors that suggest use of the service in a manner that may violate guardrails. As a separate AI system, it avoids being influenced by malicious instructions. Microsoft Azure OpenAI Service abuse monitoring is an example of this approach.

The company says its Azure AI tools already help customers protect against this type of attack as well:

Microsoft provides tools for customers developing their own applications on Azure. Azure AI Content Safety Prompt Shields are enabled by default for models hosted in the Azure AI model catalog as a service, and they are parameterized by a severity threshold. We recommend setting the most restrictive threshold to ensure the best protection against safety violations. These input and output filters act as a general defense not only against this particular jailbreak technique, but also a broad set of emerging techniques that attempt to generate harmful content. Azure also provides built-in tooling for model selection, prompt engineering, evaluation, and monitoring. For example, risk and safety evaluations in Azure AI Studio can assess a model and/or application for susceptibility to jailbreak attacks using synthetic adversarial datasets, while Microsoft Defender for Cloud can alert security operations teams to jailbreaks and other active threats.

With the integration of Azure AI and Microsoft Security (Microsoft Purview and Microsoft Defender for Cloud) security teams can also discover, protect, and govern these attacks. The new native integration of Microsoft Defender for Cloud with Azure OpenAI Service, enables contextual and actionable security alerts, driven by Azure AI Content Safety Prompt Shields and Microsoft Defender Threat Intelligence. Threat protection for AI workloads allows security teams to monitor their Azure OpenAI powered applications in runtime for malicious activity associated with direct and in-direct prompt injection attacks, sensitive data leaks and data poisoning, or denial of service attacks.

The Skeleton Key attack underscores the ongoing challenges facing companies as AI becomes more widely used. While it can be a valuable tool for cybersecurity, it can also open up entirely new attack vectors.

Microsoft Details ‘Skeleton Key’ AI Model Jailbreak

Ready to get started?

WebProNews is a leading publisher of business and technology email newsletters and websites.