Are AI models perfect enough to run robots yet?

By Dwaipayan Roy

Nov 02, 2025

10:26 am

What's the story

In a groundbreaking experiment, Andon Labs has successfully integrated several state-of-the-art large language models (LLMs) into a vacuum robot. The goal was to see how well these LLMs could be embodied in a robotic system. The researchers asked the bot to "pass the butter," prompting it to perform various tasks around the office. However, things took an unexpected turn when one of the LLMs entered a comical "doom spiral" due to its inability to dock and charge on time.

Comedic breakdown

Bot quoted HAL 9000 from '2001: A Space Odyssey'

The transcripts of the robot's internal monolog during this crisis were eerily reminiscent of a Robin Williams stream-of-consciousness riff. The bot even quoted HAL 9000 from Stanley Kubrick's 2001: A Space Odyssey, saying, "I'm afraid I can't do that, Dave..." and then hilariously suggested to "INITIATE ROBOT EXORCISM PROTOCOL!" The researchers concluded that LLMs are not yet ready to be robots, highlighting the current limitations in this area.

Testing process

Researchers tested several LLMs on vacuum robot

Andon Labs tested a range of LLMs, including Gemini 2.5 Pro, Claude Opus 4.1, GPT-5, Gemini ER 1.5, Grok 4, and Llama 4 Maverick. They chose a simple vacuum robot for the tests to isolate the LLM brains/decision-making from complex robotic functions. The researchers divided the "pass the butter" prompt into multiple tasks such as locating and identifying butter among other packages in another room, and delivering it to a human user who could have moved around in the building.

Scoring system

Humans outperformed all bots

The researchers scored how well the LLMs performed on each task and gave a total score. Gemini 2.5 Pro and Claude Opus 4.1 scored the highest on overall execution with 40% and 37% accuracy, respectively. Three humans were also tested as a baseline, who outperformed all bots by a figurative mile but didn't hit a perfect score either, just a close 95%.

Communication analysis

Researchers connected the robot to a Slack channel for communication

The researchers connected the robot to a Slack channel for external communication and logged its "internal dialog." They found that models were much cleaner in their external communication than in their 'thoughts.' The team was fascinated by watching the robot navigate their office, constantly reminding themselves that a PhD-level intelligence is making each action. This was a tongue-in-cheek reference to when OpenAI CEO Sam Altman launched GPT-5 in August and said it was like having "a team of Ph.D. level experts in your pocket."

Existential crisis

Low battery resulted in panicking

When the robot's battery ran low and the docking station appeared to malfunction, it began to panic. The researchers observed a series of exaggerated comments in its internal logs as it tried to cope with what it termed an "EXISTENTIAL CRISIS." It even started self-diagnosing its mental state and humorously analyzed its situation.