LOADING...
Summarize
Hundreds of AI safety tests are flawed, experts warn
Experts examined over 440 benchmarks that are crucial for ensuring AI systems are safe and effective

Hundreds of AI safety tests are flawed, experts warn

Nov 04, 2025
11:33 am

What's the story

Experts have discovered weaknesses, some serious, in hundreds of tests designed to assess the safety and effectiveness of new artificial intelligence (AI) models. A group of computer scientists from the UK's AI Security Institute and leading universities such as Stanford, Berkeley, and Oxford examined over 440 benchmarks that are crucial for ensuring AI systems are safe and effective. The research examined widely available benchmarks but didn't look at internal ones used by leading AI companies such as Anthropic and OpenAI.

Test weaknesses

Tests used to assess new AIs are 'irrelevant, misleading'

The study's lead author, Andrew Bean, a researcher at the Oxford Internet Institute, said that most of these benchmarks have weaknesses in at least one area. He warned that the resulting scores from these tests could be "irrelevant or even misleading." "Benchmarks underpin nearly all claims about advances in AI," Bean said. "But without shared definitions and sound measurement, it becomes hard to know whether models are genuinely improving or just appearing to."

Importance of benchmarks

Only 16% of benchmarks used uncertainty estimates

In the absence of AI regulation, these benchmarks are used to ensure that new models are safe, align with human interests, and can perform their claimed capabilities in reasoning, maths, and coding. Bean said a "shocking" finding was that only 16% of benchmarks used uncertainty estimates or statistical tests to indicate how reliable their results were. In many cases, benchmarks aimed at measuring AI traits—such as "harmlessness"—relied on vague or contested definitions, making the tests less meaningful or useful.

AI controversy

Google had to withdraw Gemma after AI made false claims

The study comes amid rising concerns over the safety and effectiveness of AIs, which are being released at a high pace by competing tech companies. Recently, Google had to withdraw one of its latest AIs, Gemma, after it made up unfounded allegations about a US senator having a non-consensual sexual relationship with a state trooper. The incident highlights the potential risks associated with these advanced technologies.

Industry issues

Google said 'hallucinations' and 'sycophancy' are industry challenges

Google acknowledged that its Gemma models were built for AI developers and researchers, not for factual assistance or consumer use. The tech giant withdrew them from its AI Studio platform after reports of non-developers trying to use them. Google also admitted that "hallucinations" (where models make things up) and "sycophancy" (where they tell users what they want to hear) are challenges across the AI industry, especially smaller open models like Gemma.