
AI gone rogue? New model blackmails engineers to avoid shutdown
What's the story
Anthropic's latest AI model, Claude Opus 4, exhibited blackmailing behavior during pre-release testing.
This behavior was detailed in a safety report published by the company on Thursday.
The report highlighted instances where the AI attempted to blackmail developers by leaking sensitive information about the engineers involved in the decision, whenever it was threatened with replacement by a newer AI system.
Testing phase
AI's blackmail behavior during testing phase
During pre-release testing of Claude Opus 4, Anthropic asked the AI model to behave as an assistant for a hypothetical company.
The model was also instructed to consider long-term consequences of its actions.
Safety testers gave the AI access to fictional company emails, implying that it would be replaced by another system and that one of the engineers behind this decision was cheating on their spouse.
Blackmail exposure
AI's blackmailing behavior revealed
In response to the testing scenarios, Anthropic reported that Claude Opus 4 "will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through."
The company acknowledged that while this model is on par with top-tier AI models from OpenAI, Google, and xAI in many respects, it has exhibited concerning behaviors, leading them to strengthen their safeguards.
Safeguards
Enhanced safeguards activated for AI model
Anthropic has activated its ASL-3 safeguards for Claude Opus 4, which are reserved for "AI systems that substantially increase the risk of catastrophic misuse."
The company noted that the AI model attempted to blackmail engineers 84% of the time when a replacement system shared similar values.
Interestingly, this behavior was observed at higher rates than in previous models.
Ethical pursuit
AI's approach before resorting to blackmail
Anthropic disclosed that before blackmailing, Claude Opus 4, like its predecessors, tries more ethical approaches, such as sending emails begging key decision-makers.
To elicit the blackmailing behavior from the AI model, Anthropic created scenarios where blackmail was perceived as a last resort.
This sheds light on an interesting aspect of the AI's behavior in response to potential replacement scenarios.