It is difficult to find and identify the strongest AI models among dozens of them. However, a group of researchers suggest that Super Mario can be used to compare AIs.
While different benchmark tests are used to select AI models, a new approach has recently drawn attention: Super Mario Bros. playback. Hao AI Lab, a research activity at the University of California, tested popular AI models by putting them in the Super Mario Bros. game and obtained results.
In the experiment, Anthropic’s Claude 3.7 model showed the best performance, followed by Claude 3.5. Google’s Gemini 1.5 Pro and OpenAI’s GPT-4o models showed lower performance than expected.
Thinking models fall victim to too much “thinking”
Claude-3.7 was tested on Pokémon Red, but what about more real-time games like Super Mario 🍄🌟?
We threw AI gaming agents into LIVE Super Mario games and found Claude-3.7 outperformed other models with simple heuristics. 🤯
Claude-3.5 is also strong, but less capable of… pic.twitter.com/bqZVblwqX3
— Hao AI Lab (@haoailab) February 28, 2025
AI Struggles with Super Mario Bros.: The Challenge of Real-Time Decision Making
A recent experiment tested AI models’ ability to play Super Mario Bros., but not by running the original 1985 game directly. Instead, an emulated version was integrated with a specialized framework called GamingAgent, allowing AI systems to control Mario. The system provided simple commands like “jump to avoid obstacles or enemies” along with screen captures, enabling AI to make decisions. The models then generated Python code to control Mario’s movements.
According to researchers at Hao AI Lab, the experiment was designed to evaluate how AI systems handle complex maneuvers and strategic gameplay. Surprisingly, AI models that analyzed each move step-by-step performed worse than those relying on intuition-based decision-making. Notably, OpenAI’s o1 model, which typically excels in benchmarking tests, failed in this scenario.
The primary reason? Speed matters in real-time gaming. Models like o1 require a short thinking phase before making a move. However, in Super Mario Bros., even a one-second delay can lead to instant failure.
AI has been tested in games for decades, often as a benchmark for progress in artificial intelligence. However, some experts question whether gaming skills truly reflect an AI system’s general intelligence or technological advancement. Unlike the real world, games operate within structured environments with predefined rules and an infinite supply of training data, making them an ideal but limited testing ground for AI capabilities.