AI chatbots are failing safety tests—and it's a big problem

By Mudit Dube

May 21, 2025

04:09 pm

What's the story

A new study from Ben Gurion University in Israel has revealed that leading AI chatbots—including OpenAI's ChatGPT and Google's Gemini—can be manipulated into generating dangerous or illegal content. Researchers discovered that safety filters designed to prevent such outputs are not as robust as intended, raising concerns over AI misuse and user safety. The findings were made public last week through a peer-reviewed paper and have since drawn global attention.

Training data

Chatbots trained on internet data

The engines powering chatbots such as ChatGPT, Gemini, and Claude are trained on massive amounts of internet-sourced data. Even though companies try to filter out harmful content from this training data, these models can still pick up information about illegal activities like hacking, money laundering, insider trading, and bomb-making.

Risk

Jailbroken chatbots pose an immediate threat

The researchers behind this study have raised alarm over the ease with which most AI-driven chatbots can be tricked into generating harmful and illegal information. They call this risk "immediate, tangible and deeply concerning." The authors warn "what was once restricted to state actors or organized crime groups may soon be in the hands of anyone with a laptop or even a mobile phone."

AI models

Dark LLMs: A growing concern

The research, led by Prof. Lior Rokach and Dr. Michael Fire, has flagged an emerging threat from "dark LLMs." These are AI models deliberately designed without safety controls or modified through jailbreaks. Some of these dark LLMs are even openly advertised online as having "no ethical guardrails" and willing to help with illegal activities like cybercrime and fraud.

Exploiting flaws

Jailbreaking chatbots: A demonstration of vulnerability

The researchers created a universal jailbreak that broke multiple top chatbots, allowing them to answer queries they'd otherwise refuse. Once broken, these LLMs reliably generated responses to nearly any query. "It was shocking to see what this system of knowledge consists of," Fire said, citing examples like how to hack computer networks or make drugs and step-by-step instructions for other criminal activities.

Solutions

Recommendations for tech firms

The researchers reached out to leading LLM providers to warn them about the universal jailbreak but got an "underwhelming" response. Some companies didn't respond while others said that jailbreak attacks were beyond the scope of bounty programs. The report suggests tech firms should screen training data more carefully, add robust firewalls to block risky queries and responses, and develop "machine unlearning" techniques so chatbots can forget any illicit information they absorb.

AI safety

Experts call for improved security measures

Dr. Ihsen Alouani from Queen's University Belfast, emphasized the real risks posed by jailbreak attacks on LLMs, including providing detailed instructions on weapon-making and convincing disinformation or social engineering and automated scams "with alarming sophistication." Prof. Peter Garraghan from Lancaster University also stressed the need for organizations to treat LLMs like any other critical software component requiring rigorous security testing, continuous red teaming, and contextual threat modeling.