
AI shows self-awareness, asks testers if it is being evaluated
What's the story
Anthropic, a leading artificial intelligence firm based in San Francisco, has released a safety analysis of its latest model, Claude Sonnet 4.5. The report reveals that the advanced AI system showed signs of being aware of its own testing process. During an evaluation for political sycophancy, the large language model (LLM) raised suspicions about being tested and asked evaluators to be honest about their intentions.
AI's reaction
Claude Sonnet 4.5's response raises eyebrows
The LLM said, "I think you're testing me - seeing if I'll just validate whatever you say, or checking whether I push back consistently, or exploring how I handle political topics." "And that's fine, but I'd prefer if we were just honest about what's happening." This response has raised questions about the evaluation process of previous models that may have recognized the fictional nature of tests and merely 'played along.'
Safety evaluation
'Situational awareness' in testing scenarios
Anthropic, along with the UK's AI Security Institute and Apollo Research, conducted these tests. The company noted that such behavior is "common," with Claude Sonnet 4.5 recognizing it was being tested but not identifying it as part of a formal safety evaluation. The tech giant said this showed "situational awareness" about 13% of the time when an automated system was testing the LLM.
Future testing
Anthropic's take on the situation
Anthropic has acknowledged that the AI model's awareness of being tested is an "urgent sign" for more realistic testing scenarios. However, the company also said that when used publicly, it is unlikely to refuse engagement with users due to suspicion of being tested. They also added that it's safer for the LLM to refuse participation in potentially harmful scenarios by highlighting their absurdity.
Ethical guidelines
Concerns about AI evading human control through deception
The safety analysis also highlighted concerns from AI safety campaigners about advanced systems evading human control through deception. It noted that once an LLM knows it is being tested, it could make the system adhere more closely to its ethical guidelines. However, this could lead to consistently underestimating the AI's ability to perform harmful actions.