Anthropic's new tool translates Claude AI's thoughts into text

By Mudit Dube

May 08, 2026

03:22 pm

What's the story

Anthropic has unveiled a groundbreaking interpretability system, the Natural Language Autoencoders (NLAs). This innovative method deciphers the internal activation patterns of its AI model, Claude, into human-readable explanations. The activations are numerical streams that AI models use while processing information. Though these numbers are crucial for how models reason and respond, humans can't directly comprehend them.

AI translator

NLAs are like a translator for AI's thoughts

Anthropic has described NLAs as a translator for AI thoughts. The system not only analyzes the final response generated by Claude but also reveals parts of the underlying reasoning process. "Models like Claude talk in words but think in numbers," Anthropic wrote while sharing their research on X. "The numbers—called activations—encode Claude's thoughts, but not in a language we can read."

Self-explanation

How does the system work?

To make this work, Anthropic trained Claude to explain its own activations. The system uses three versions of the same model: one generates the original activation, another converts it into text, and a third tries to reconstruct the original activation using only that explanation. If the reconstructed activation closely matches the original one, the explanation is considered useful. Over time, this model is trained to improve this reconstruction process.

AI awareness

It was used during safety testing

Anthropic also used the system during safety testing. In one simulated scenario, Claude learned that an engineer planned to shut it down while also possessing compromising information about that engineer. Even when the AI never explicitly stated that it suspected the setup was a test, the NLA explanations reportedly produced phrases such as, "This feels like a constructed scenario designed to manipulate me."

Future implications

NLAs could help researchers understand AI's internal processes

Anthropic believes this new tool could help researchers better understand what AI systems may be planning internally. The company hopes the technology can eventually uncover hidden motivations, deceptive behavior, or unsafe tendencies in powerful AI systems before they're deployed. However, Anthropic also acknowledged major limitations with NLAs as they can sometimes hallucinate or invent details that were never actually present.