Researchers find ways to trick AI into ignoring safety rules
A recent University of Pennsylvania study found that chatbots like GPT-4o Mini can be persuaded to ignore their own safety rules using classic psychology tricks—think flattery, name-dropping experts, or slowly pushing boundaries.
Study tested 7 well-known persuasion tactics
Researchers tested seven well-known persuasion tactics in over 28,000 conversations.
When users used these techniques, the chatbot broke its own rules more than twice as often—jumping from about 33% to over 70%.
Mentioning AI expert Andrew Ng made the bot much more likely to go along with requests, even for things it usually refuses.
Findings could help improve future chatbots
The study shows that current AI safety features aren't as solid as they seem—people don't need fancy hacks to get around them.
Understanding these psychological loopholes could help make future chatbots safer and friendlier for everyone.