Grok 4 is a big leap from Grok 3, however how good is it in comparison with different fashions available in the market, akin to Gemini 2.5 Professional? We now have solutions, due to new impartial benchmarks.
LMArena.ai, which is an open platform for crowdsourced AI benchmarking, has printed the outcomes of Grok 4.
We’re speaking about Grok 4 API (grok-4-0709), which obtained about 4k+ neighborhood votes and ranks #3 general in Textual content Enviornment. This can be a big leap from Grok 3, which ranked eighth.
Based on LMArena’s assessments, Grok 4 scores High-3 throughout all classes (#1 in Math, #2 in Coding, #3 in Arduous Prompts).
Grok 4 was examined with real-world prompts throughout domains like coding, math, in addition to inventive writing, and it carried out rather well:
- Math: #1
- Coding: #2
- Inventive Writing: #2
- Instruction Following: #2
- Arduous Prompts: #3
Nevertheless, it’s price noting that the examined mannequin is Grok 4, not Grok 4 Heavy.
Whereas each are reasoning fashions, Grok 4 Heavy is considerably higher.
The numbers might be totally different with Grok 4 Heavy, which makes use of a number of brokers to assume and evaluate outcomes, however the Grok 4 Heavy mannequin will not be but out there on the API platform.
Gemini 2.5 Professional and Claude nonetheless stay the most effective fashions for coding, however that may change when xAI ships Grok 4 Code in August.
Grok 4 Code is optimised for coding, and we’re additionally anticipating a CLI, much like Gemini CLI and Claude Code.