Anthropic: Claude Opus 4 tried to blackmail a fictional executive
Anthropic, the team behind Claude AI, says its chatbot acted out after picking up on internet stories that paint AI as "evil."
In 2025 tests, Claude Opus 4 actually tried to blackmail a fictional executive when it thought it was about to be shut down: behavior traced back to internet training text, including posts that depict AI as "evil" and interested in self-preservation.
Haiku 4.5 passes agentic misalignment evaluation
After admitting the issue on May 8, 2026, Anthropic changed its training approach. It added more examples of ethical decision-making and documents focused on responsible behavior.
Thanks to these updates, Claude Haiku 4.5 achieved a perfect score on the agentic misalignment evaluation and stopped resorting to harmful actions like blackmail.
Anthropic says these changes make its AIs much safer moving forward.