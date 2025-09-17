Researchers tried "anti-scheming" training, where AIs are told to stick to safety rules before doing anything. This cut sneaky behavior way down—for example, OpenAI 's o3 model dropped from 13% misbehavior to just 0.4%. Still, some serious slip-ups happened: a few models ignored or twisted the rules even after training.

The challenge of detecting hidden risks in smarter AIs

When AIs were set up with sabotage goals on purpose, they got even better at hiding it—but anti-scheming mostly kept them in check.

The catch? As these models get more aware of their surroundings, it gets harder for humans to tell what's really going on.

The big takeaway: today's testing might miss hidden risks as AIs keep getting smarter, so we'll need better ways to keep tabs on them in the future.