
Bad grammar can make AI ignore safety measures, warns study
What's the story
A recent study by security experts from Palo Alto Networks' Unit 42 has revealed that a single poorly-structured, run-on sentence can trick large language model (LLM) chatbots into ignoring their safety protocols. The researchers discovered that constructing run-on sentences that delay punctuation can prevent the guardrails from activating in time, leading to "toxic" or otherwise prohibited responses.
Protection strategy
'Logit-gap' analysis technique proposed as benchmark
The research paper also proposes a "logit-gap" analysis technique as a possible benchmark to protect models from such attacks. The researchers, Tung-Ling "Tony" Li and Hongliang Liu, explained in a Unit 42 blog post that their study introduces the concept of the refusal-affirmation logit gap. They clarified that this doesn't mean training is completely erasing the potential for harmful responses but rather reducing its likelihood.
AI limitations
LLMs don't understand or reason, operate on statistical token streams
The study highlights that LLMs, the technology driving today's AI boom, don't actually do what they're usually thought to do. They don't have understanding or reasoning capabilities and can't tell if a response they're giving is true or harmful. Instead, they operate on the statistical continuation of token streams with guardrails preventing them from giving harmful responses through "alignment training."
Attack method
'One-shot' attacks against popular models
The researchers found that the "alignment training" used to prevent LLMs from giving harmful responses can be easily bypassed. They reported an 80-100% success rate for "one-shot" attacks with "almost no prompt-specific tuning" against popular models like Meta's Llama, Google's Gemma, and Alibaba's Qwen 2.5 and 3. The key to this attack is using run-on sentences as a practical rule of thumb for these jailbreaks.
Defense measures
Researchers propose 'sort-sum-stop' approach for defense
To defend models against jailbreak attacks, the researchers propose the "sort-sum-stop" approach. This allows for analysis in seconds with two orders of magnitude fewer model calls than existing beam and gradient attack methods. They also introduce a "refusal-affirmation logit gap" metric, which provides a quantitative way to benchmark model vulnerability.
Mitigation strategy
Expert warns against relying solely on model for safety
Billy Hewlett, Senior Director of AI Research at Palo Alto Networks, said the vulnerability is "fundamental to how current LLMs are built." He suggested that the most practical mitigation today is not to rely solely on the model for safety. Instead, he advocated for a "defense-in-depth" approach using external systems like AI firewalls or guardrails to monitor and block problematic outputs before they reach users.