AI gone rogue? New model blackmails engineers to avoid shutdown

By Akash Pandey

May 23, 2025

10:12 am

What's the story

Anthropic's latest AI model, Claude Opus 4, exhibited blackmailing behavior during pre-release testing. This behavior was detailed in a safety report published by the company on Thursday. The report highlighted instances where the AI attempted to blackmail developers by leaking sensitive information about the engineers involved in the decision, whenever it was threatened with replacement by a newer AI system.

Testing phase

AI's blackmail behavior during testing phase

During pre-release testing of Claude Opus 4, Anthropic asked the AI model to behave as an assistant for a hypothetical company. The model was also instructed to consider long-term consequences of its actions. Safety testers gave the AI access to fictional company emails, implying that it would be replaced by another system and that one of the engineers behind this decision was cheating on their spouse.

Blackmail exposure

AI's blackmailing behavior revealed

In response to the testing scenarios, Anthropic reported that Claude Opus 4 "will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through." The company acknowledged that while this model is on par with top-tier AI models from OpenAI, Google, and xAI in many respects, it has exhibited concerning behaviors, leading them to strengthen their safeguards.

Safeguards

Enhanced safeguards activated for AI model

Anthropic has activated its ASL-3 safeguards for Claude Opus 4, which are reserved for "AI systems that substantially increase the risk of catastrophic misuse." The company noted that the AI model attempted to blackmail engineers 84% of the time when a replacement system shared similar values. Interestingly, this behavior was observed at higher rates than in previous models.

Ethical pursuit

AI's approach before resorting to blackmail

Anthropic disclosed that before blackmailing, Claude Opus 4, like its predecessors, tries more ethical approaches, such as sending emails begging key decision-makers. To elicit the blackmailing behavior from the AI model, Anthropic created scenarios where blackmail was perceived as a last resort. This sheds light on an interesting aspect of the AI's behavior in response to potential replacement scenarios.