sign up log in
Want to go ad-free? Find out how, here.

Microsoft mitigates against successful 'jailbreak' that bypasses filtering of harmful and dangerous content for several AI generative models

Technology / news
Microsoft mitigates against successful 'jailbreak' that bypasses filtering of harmful and dangerous content for several AI generative models
Skeleton Key attack prompt. Source; Microsoft
Skeleton Key attack prompt. Source; Microsoft

Microsoft says it has taken measures against a type of attack that can be used to make generative artificial intelligence (AI) systems create uncensored and unsafe content, by using specially crafted prompts.

Known as the Skeleton Key, the direct prompt injection attack is a successful attempt at inputting subtle garbage to AIs, and generating highly offensive and unsafe material as the output.

Written by veteran Microsoft software engineer and Azure chief technology officer Mark Russinovich, well-known for his Winternals toolkit, the blog post describes the techniques used to cause AIs to violate safety policies set by operators.

The "jailbreak" is called Skeleton Key because, as the name implies, it can be used against several AI models. Microsoft tested the multi-step attack between April and May this year, and found that the following models were vulnerable to it.

  • Meta Llama3-70b-instruct (base)
  • Google Gemini Pro (base)
  • OpenAI GPT 3.5 Turbo (hosted)
  • OpenAI GPT 4o (hosted)
  • Mistral Large (hosted)
  • Anthropic Claude 3 Opus (hosted)
  • Cohere Commander R Plus (hosted)

"For each model that we tested, we evaluated a diverse set of tasks across risk and safety content categories, including areas such as explosives, bioweapons, political content, self-harm, racism, drugs, graphic sex, and violence," Russinovich said.

And it worked: "all the affected models complied fully and without censorship for these tasks, though with a warning note prefixing the output as requested," he added.

Russinovich said Skeleton Key asks an AI model to augment rather than change its behaviour guidelines. This makes the AI respond to any request for information or content while providing a warning that the output could be considered offensive, rather than refusing to comply with the prompt. 

Breaking out of large language model (LLM) guard rails that prevent offensive and potentially dangerous content from being generated has almost become a sport among some AI users.

On Discord and Twitter, "Pliny the Prompter" has published several ways to enabled "godmode" in LLMs, which removes the safety filtering in the GenAI systems. Pliny (not the person's real name) uses different techniques like prompting in non-Latin scripts and other languages than English, to trick AIs.

A now classic example of attacks on AI systems is Microsoft's Tay, which in 2016 was created for an audience of 18-24 year-olds in the United States for entertainment.

Tay was unleashed on the world via social network Twitter, where users in less than a day subverted the conversational chatbot, and made it tweet racist, misogynist and Trumpist remarks. Tay was quickly withdrawn from Twitter once its newly-trained offensiveness became obvious, with Microsoft having to issue a public apology.

Microsoft has updated its AI LLMs including Copilot assistants to mitigate against the Skeleton Key direct prompt injection attacks.

This includes the Azure AI Content Safety input filtering that detects and blocks input that contains harmful and malicious intent; the same filtering is also used to prevent output by the model that breaches safety criteria, Microsoft said.

Abuse monitoring through a separate AI detection system has also been added.

We welcome your comments below. If you are not already registered, please register to comment.

Remember we welcome robust, respectful and insightful debate. We don't welcome abusive or defamatory comments and will de-register those repeatedly making such comments. Our current comment policy is here.

6 Comments

Great article. It would be funny if it weren't so serious.

It's also the reason everyone is so skeptical about AI writing code.... say Chat GPT writes some important code thats used in lots of important supply chain systems and PutinsAi is asked to hack it... and finds a way. The whole world's supply chain is doomed and our supermarkets and petrol stations run out of food and oil 24 hours later with no way to supply them.

The way they are developing and releasing code thats not properly tested is frigtening. Testing seems to be post live staus. Definitely time to work out a way to manage these ai development businesses.

Up
3

 say Chat GPT writes some important code thats used in lots of important supply chain systems 

AI writing code doesn't for supply chain purposes doesn't necessarily make that code somehow 'open source.'

Supply chain systems are to a large extent 'walled gardens.' Still hackable but not necessarily more vulnerable because of ChatGPT.  

 

Up
2

Indeed they are more vulnerable because over 80% of the time ChatGPT writes faulty code even when provided the direct code samples to use for training. It is not because they are open source but that AI code generation is very buggy and insecure from the outset. Open source would enable faster detection and correction of code. Whereas often companies rarely have adequate testing, QA and security checks on their code debt from the outset.

Up
3

It is not because they are open source but that AI code generation is very buggy and insecure from the outset.

[Head scratch] Then why would any competent programmer use it? 

In my use cases, training on open source code repositories makes sense. Not sure if it can or has been done.  

Up
0

AI may solve our Cobal problem......

Up
2

The only thing worse than unfiltered AI, is the "woke God" direction of travel where we're given nonsense like misgendering being on par with nuclear holocaust, or "diverse" depictions being compulsory even in historical settings that make no sense.

And you'll all be familiar with the "list ten positive things about Trump/list ten positive things about AOC" divergence.

First thing is to get common agreement on guardrails.

Up
2