OpenAI’s o3 AI mannequin scores decrease on a benchmark than the corporate initially implied

April 21, 2025

173

A discrepancy between first- and third-party benchmark outcomes for OpenAI’s o3 AI mannequin is elevating questions in regards to the firm’s transparency and mannequin testing practices.

When OpenAI unveiled o3 in December, the corporate claimed the mannequin might reply simply over a fourth of questions on FrontierMath, a difficult set of math issues. That rating blew the competitors away — the next-best mannequin managed to reply solely round 2% of FrontierMath issues appropriately.

“In the present day, all choices on the market have lower than 2% [on FrontierMath],” Mark Chen, chief analysis officer at OpenAI, mentioned throughout a livestream. “We’re seeing [internally], with o3 in aggressive test-time compute settings, we’re capable of recover from 25%.”

Because it seems, that determine was doubtless an higher sure, achieved by a model of o3 with extra computing behind it than the mannequin OpenAI publicly launched final week.

Epoch AI, the analysis institute behind FrontierMath, launched outcomes of its unbiased benchmark assessments of o3 on Friday. Epoch discovered that o3 scored round 10%, effectively under OpenAI’s highest claimed rating.

OpenAI has launched o3, their extremely anticipated reasoning mannequin, together with o4-mini, a smaller and cheaper mannequin that succeeds o3-mini.

We evaluated the brand new fashions on our suite of math and science benchmarks. Ends in thread! pic.twitter.com/5gbtzkEy1B

— Epoch AI (@EpochAIResearch) April 18, 2025

That doesn’t imply OpenAI lied, per se. The benchmark outcomes the corporate revealed in December present a lower-bound rating that matches the rating Epoch noticed. Epoch additionally famous its testing setup doubtless differs from OpenAI’s, and that it used an up to date launch of FrontierMath for its evaluations.

“The distinction between our outcomes and OpenAI’s is likely to be on account of OpenAI evaluating with a extra highly effective inner scaffold, utilizing extra test-time [computing], or as a result of these outcomes have been run on a unique subset of FrontierMath (the 180 issues in frontiermath-2024-11-26 vs the 290 issues in frontiermath-2025-02-28-private),” wrote Epoch.

In line with a submit on X from the ARC Prize Basis, a company that examined a pre-release model of o3, the general public o3 mannequin “is a unique mannequin […] tuned for chat/product use,” corroborating Epoch’s report.

“All launched o3 compute tiers are smaller than the model we [benchmarked],” wrote ARC Prize. Typically talking, larger compute tiers may be anticipated to realize higher benchmark scores.

Re-testing launched o3 on ARC-AGI-1 will take a day or two. As a result of right this moment’s launch is a materially totally different system, we’re re-labeling our previous reported outcomes as “preview”:

o3-preview (low): 75.7%, $200/job
o3-preview (excessive): 87.5%, $34.4k/job

Above makes use of o1 professional pricing…

— Mike Knoop (@mikeknoop) April 16, 2025

OpenAI’s personal Wenda Zhou, a member of the technical workers, mentioned throughout a livestream final week that the o3 in manufacturing is “extra optimized for real-world use instances” and pace versus the model of o3 demoed in December. In consequence, it might exhibit benchmark “disparities,” he added.

“[W]e’ve performed [optimizations] to make the [model] extra value environment friendly [and] extra helpful generally,” Zhou mentioned. “We nonetheless hope that — we nonetheless suppose that — this can be a significantly better mannequin […] You gained’t have to attend as lengthy if you’re asking for a solution, which is an actual factor with these [types of] fashions.”

Granted, the truth that the general public launch of o3 falls in need of OpenAI’s testing guarantees is a little bit of a moot level, because the firm’s o3-mini-high and o4-mini fashions outperform o3 on FrontierMath, and OpenAI plans to debut a extra highly effective o3 variant, o3-pro, within the coming weeks.

It’s, nevertheless, one other reminder that AI benchmarks are finest not taken at face worth — significantly when the supply is an organization with providers to promote.

Benchmarking “controversies” have gotten a standard prevalence within the AI business as distributors race to seize headlines and mindshare with new fashions.

In January, Epoch was criticized for ready to reveal funding from OpenAI till after the corporate introduced o3. Many teachers who contributed to FrontierMath weren’t knowledgeable of OpenAI’s involvement till it was made public.

Extra not too long ago, Elon Musk’s xAI was accused of publishing deceptive benchmark charts for its newest AI mannequin, Grok 3. Simply this month, Meta admitted to touting benchmark scores for a model of a mannequin that differed from the one the corporate made obtainable to builders.

Up to date 4:21 p.m. Pacific: Added feedback from Wenda Zhou, a member of the OpenAI technical workers, from a livestream final week.

Previous articleAthletic wheeled robodog kicks up powder for snow parkour

Next articleThis Gripper Is on a Roll — Actually

OpenAI’s o3 AI mannequin scores decrease on a benchmark than the corporate initially implied

vivo X Fold5 in for assessment

TechCrunch Mobility: Tesla enters its Grok period, and youths come for robotaxis

Amazon Prime Day Sale 2025: Greatest Offers On Smartwatches Beneath Rs. 5,000 in India

LEAVE A REPLY Cancel reply

Most Popular

Information safety is the muse of belief in bodily AI

This Week’s Superior Tech Tales From Across the Net (By March 14)

A Cambridge Lab Mistake Reveals a Highly effective New Strategy to Modify Drug Molecules – NanoApps Medical – Official web site

Xcode 26.3: The developer disk picture couldn’t be mounted on this machine [duplicate]

Recent Comments

ABOUT US

POPULAR POSTS

Information safety is the muse of belief in bodily AI

This Week’s Superior Tech Tales From Across the Net (By March 14)

A Cambridge Lab Mistake Reveals a Highly effective New Strategy to Modify Drug Molecules – NanoApps Medical – Official web site

POPULAR CATEGORY