Summarize

After Pokemon, scientists using Super Mario to benchmark AI models

By Dwaipayan Roy

Mar 04, 2025

12:25 pm

What's the story

Researchers at the University of California, San Diego's Hao AI Lab have proposed a new way to test artificial intelligence (AI) capabilities. The team used the classic video game, Super Mario Bros, as a testing ground for different AI models. This new method is considered more difficult than previous benchmarks like Pokemon. The experiment integrated the game with GamingAgent, an in-house developed framework that let AIs control Mario and complete tasks like dodging obstacles and enemies.

Steps

How was the test done?

The Hao AI Lab's experiment wasn't done with the original 1985 version of Super Mario Bros, but through an emulator that included GamingAgent. This way, the setup gave basic instructions and in-game screenshots to the AI, which generated inputs in Python code to control Mario. The lab found this unique gaming environment forced each model to devise complex maneuvers and gameplay strategies, thus testing their adaptability and problem-solving skills.

Performance

Reasoning models struggled in real-time gaming scenario

Interestingly, reasoning models like OpenAI's GPT-4o performed poorly in this real-time gaming scenario, despite their overall better performance on most benchmarks. This was due to their slower decision-making process, which often took seconds to determine actions. Non-reasoning models, on the other hand, outperformed them in the Super Mario Bros game where timing is everything and can make all the difference between success and failure.

Evaluation crisis

AI gaming benchmarks spark debate among experts

Using games such as Super Mario Bros to benchmark AI isn't a new idea. However, some experts have raised concerns over its relevance in determining how far technology has come, given the abstract nature of games and their relatively simple challenges compared to the real world. This debate has resulted in what Andrej Karpathy, a research scientist at OpenAI, called an "evaluation crisis."