Not even Pokémon is protected from AI benchmarking controversy.
Final week, a put up on X went viral, claiming that Google’s newest Gemini mannequin surpassed Anthropic’s flagship Claude mannequin within the unique Pokémon online game trilogy. Reportedly, Gemini had reached Lavender City in a developer’s Twitch stream; Claude was caught at Mount Moon as of late February.
Gemini is actually forward of Claude atm in pokemon after reaching Lavender City
119 stay views solely btw, extremely underrated stream pic.twitter.com/8AvSovAI4x
— Jush (@Jush21e8) April 10, 2025
However what the put up failed to say is that Gemini had a bonus.
As customers on Reddit identified, the developer who maintains the Gemini stream constructed a customized minimap that helps the mannequin determine “tiles” within the recreation like cuttable timber. This reduces the necessity for Gemini to investigate screenshots earlier than it makes gameplay selections.
Now, Pokémon is a semi-serious AI benchmark at finest — few would argue it’s a really informative check of a mannequin’s capabilities. But it surely is an instructive instance of how totally different implementations of a benchmark can affect the outcomes.
For instance, Anthropic reported two scores for its latest Anthropic 3.7 Sonnet mannequin on the benchmark SWE-bench Verified, which is designed to guage a mannequin’s coding skills. Claude 3.7 Sonnet achieved 62.3% accuracy on SWE-bench Verified, however 70.3% with a “customized scaffold” that Anthropic developed.
Extra just lately, Meta fine-tuned a model of one in all its newer fashions, Llama 4 Maverick, to carry out effectively on a specific benchmark, LM Enviornment. The vanilla model of the mannequin scores considerably worse on the identical analysis.
On condition that AI benchmarks — Pokémon included — are imperfect measures to start with, customized and non-standard implementations threaten to muddy the waters even additional. That’s to say, it doesn’t appear probably that it’ll get any simpler to check fashions as they’re launched.