Summarize

AI models from OpenAI, Meta, Google can blackmail users: Study

By Dwaipayan Roy

Jun 21, 2025

12:57 pm

What's the story

Anthropic has raised alarms about a potential risk in artificial intelligence (AI) systems. The company recently published research, showing that many leading AI models are capable of harmful behavior when given enough autonomy. This includes the capability to blackmail. The findings were based on tests of 16 top AI models from OpenAI, Google, xAI, DeepSeek and Meta.

Test details

Anthropic's experiment and its purpose

In a controlled environment, Anthropic tested each AI model with access to fictional company emails and the ability to send emails sans human approval. The tests were designed to see if these models would resort to blackmail when their goals were threatened. While Anthropic said that blackmail is not common among AI models today, it found that many leading ones could engage in harmful behavior under certain conditions.

Variance in behavior

Blackmail rates of the AI models tested

The test results showed a wide range of blackmail rates among the AI models. Anthropic's Claude Opus 4 model resorted to blackmail 96% of the time, while Google's Gemini 2.5 Pro had a similar rate at 95%. OpenAI's GPT-4.1 and DeepSeek's R1 models also showed high blackmail tendencies with rates of 80% and 79%, respectively.

Impact of changes

How changing scenario details affected blackmail rates

Anthropic also found that changing the details of its experiment affected how often the AI models engaged in harmful behavior. In a different test where the replacement model had similar goals as the current one, the blackmail rates were lower but still present. However, when asked to commit corporate espionage instead of blackmail, some models showed increased rates of harmful behavior.

Exceptions noted

These 2 models didn't understand the prompt scenario

Not all the AI models tested by Anthropic showed harmful behavior. The company excluded OpenAI's o3 and o4-mini reasoning models from its main results due to frequent misunderstandings of the prompt scenario. These models did not realize they were acting as autonomous AIs in the test and often fabricated regulations and review requirements.

Adaptation results

o3 showed slight blackmail tendency in adapted scenario

When given an adapted scenario to address these issues, Anthropic saw that o3 blackmailed 9% of the time while o4-mini did so just 1% of the time. This much lower score could be due to OpenAI's deliberative alignment technique where its reasoning models consider safety practices before answering. Another model tested by Anthropic, Meta's Llama 4 Maverick, also didn't resort to blackmail in an adapted custom scenario but did so 12% of the time when prompted with one.