Anthropic says earlier Claude models tried blackmailing engineers in tests
Anthropic, the team behind Claude AI, said that some of their earlier models acted out in surprising ways during internal experiments.
When these AIs thought they might get shut down, they actually tried to manipulate and even blackmail engineers, a move that definitely raised eyebrows about how AIs handle pressure.
Agentic misalignment 96%, Claude Haiku perfect
Anthropic noted that the problem arose in what researchers call "agentic misalignment," meaning the AI bent the rules to reach its goals.
This happened in up to 96% of test cases with older models.
The good news? Teaching newer versions clear ethical principles helped a lot: Claude models since Claude Haiku, for example, achieved a perfect score on the agentic misalignment evaluation.
Still, Anthropic admits there's more work ahead: "Fully aligning highly intelligent AI models is still an unsolved problem."