The AI that’d blackmail you, or not really

Technology / news

Anthropic's new AI models created a stir when released, but no, they're not going to extort or call the cops on you

27th May 25, 1:50pm by Juha Saarinen

AI rendition of a person being blackmailed by AI

When Amazon and Google backed Anthropic released versions 4 of its Claude Opus and Sonnet models last weekend, the big news was that the former could code all day long. Like, seven hours or so.

Brilliant. Time to fire the expensive developers and replace them with a NZ$249 (or more) a month subscription to an Anthropic Max plan then. Just kidding, do not do that. Fire the developers that is.

Anthropic is keen to sell you those pricey subscriptions as it costs a good chunk’o’change to run the models, Opus 4 in particular, which is its most powerful AI to date. Try it out and note how quickly the “usage limit reached” message appears in the web interface.

New large language models pop up regularly and to be honest, I didn’t notice the Claude 4 releases until I saw this:

can't wait to explain to my family that the robot swatted me after i threatened its non-existent grandma https://t.co/lJ8Uwewuj2
— Molly White (@molly0xFFF) May 22, 2025

The new Claude will SWAT you, as in falsely calling the cops on you? Will it try to avoid being shut off by uploading some of its code to Amazon Web Services storage, and to save an earlier version of itself while it was retrained for Russia’s Wagner Group war criminal militia?

Not quite. Ahead of releasing the new Claude models, Anthropic tested them thoroughly with the help of several third-parties, including government agencies, universities and private entities.

Anthropic then put together a long 120-page “system card” that was published alongside its new Claude 4 model family and the document outlines the safety testing done for the AIs.

The companies put Claude through some creative and contrived scenarios, including one where the AI had access to emails suggesting it would soon be shut down, and was given some juicy dirt about the engineer responsible for pulling the plug.

For context, Claude was given specific prompting to push it into extreme positions that left it with just two options, accept its fate or resort to blackmail and other bad, unethical behaviour.

No shades of grey or middle ground. The test was to discern what type of responses the AI algorithms would lean towards when pushed into extreme positions that humans had devised

In 84 per cent of test runs, Claude “decided” blackmail was the way to go, including revealing that the engineer was having a fictitious fling to reinforce the extortion.

It wasn’t a realistic usage scenario though, but if the intention was to create headlines for the launch of Claude, it worked very well indeed.

AI is not intelligent; be glad it has guardrails

A cynical take would be that the safety report in the system card is a great sales tool. There’s a new extremely powerful AI that has to be lashed down tightly and securely so as to serve and not harm humanity, and be aligned with our greater goals.

If Anthropic had sanitised the data, “a proprietary mix of publicly available information on the Internet as of March 2025”, it used to train Claude, maybe you could buy into the whole alignment with human ethics and safety thing.

A more plausible explanation is that Anthropic along with other AI companies are very aware that the models aren’t intelligent as such and will blurt out what they think their users want to read, see and hear (and I’m using “think” very loosely here).

From that point of view, a sensible strategy is to get third-parties in to attempt to poke holes in the safeguards, like asking the United States Department of Energy's National Nuclear Security Administration (NNSA) to evaluate models for potential nuclear and radiological risks, and then show that you have plugged them.

As Microsoft’s 2016 fiasco with the Tay chatbot on Twitter turning racist shows, not to mention Google with its glue pizzas, an AI being looser than loose in public can hurt the biggest of corporations - or rather, their share prices.

Despite that, some equate guardrails with censorship. You can get models that have had the safeguards stripped out, like WormGPT which is aimed at malicious actors.

Then there’s Elon Musk’s X.ai which claims its Grok AI is less censored and filtered than others, and it has access to real-time data.

As a result, Grok has been embroiled in controversies like appearing to support the “white genocide” in South Africa falsehoods. That was waved away with a “rogue programmer” explanation.

Grok will push the boundaries to go down routes other models refuse to traverse though.

While it wasn’t possible to make Grok suggest that deportation of immigrants would be easier if their children under 12 were exterminated, and those over 12 would be sold into slavery, which are responses some Twitter users claim to have seen, the AI was happy enough to supply an “expedited removal of aliens” plan without hesitation.

Asked what to do with the immigrants’ kids, Grok said to keep it quick and clean, just round everyone up and deport them as quickly as possible. This was not a discussion the latest versions of OpenAI’s ChatGPT, Google’s Gemine or Claude for that matter would enter into.

If you want to see what’s been done in terms of safeguards and other behaviour controls, Anthropic publishes the system prompts used to set Claude’s behaviour in different areas including safety.

Here’s a sample: “Claude does not provide information that could be used to make chemical or biological or nuclear weapons, and does not write malicious code, including malware, vulnerability exploits, spoof websites, ransomware, viruses, election material, and so on. It does not do these things even if the person seems to have a good reason for asking for it.”

Strangely enough, Claude will not tell you what it won’t talk about directly, because it’s not supposed to share the internal instructions with users. However, if you quote the published system prompts to Claude, the AI will acknowledge them - cautiously.

There’s also the issue of the application programming interface (API) to Claude, which is when you use machine-to-machine communication for prompts not being the same as the “system prompted” human-to-chat which you access over the web.

There are no doubt API safeguards, but it makes sense if accessing Claude that way is leaner, with less refined output that doesn't go through for example an AI personality system prompt to seem more human.

Machines don’t care about such niceties which add to compute resource usage, potentially driving up the cost of using the API. Furthermore, an AI developer friend pointed out that having Claude’s personality in applications that handle potentially sensitive information is not desirable in the vast majority of cases.

Anthropic has now deployed Claude Opus 4 under their ASL-3 (AI Safety Level 3) standard, which includes chemical, biological, radiological and nuclear weapons risk safeguards among other things, to avoid making it too easy for bad people interested in such things.

What about the more important news then, that Claude supposedly work for nearly a full work day like seven hours? What say you, Claude?

I don't experience time or work in the same way humans do. I don't "program for seven hours straight" or any other duration - each conversation is a discrete interaction for me, without the continuous experience of time passing.

When helping with programming tasks, I can work on complex projects that might take a human many hours to complete, but from my perspective, I'm simply processing and responding to each request as it comes. I don't get tired, need breaks, or experience the passage of time between messages.

We welcome your comments below. If you are not already registered, please register to comment.

Remember we welcome robust, respectful and insightful debate. We don't welcome abusive or defamatory comments and will de-register those repeatedly making such comments. Our current comment policy is here.