A brand new AI coding problem simply printed its first outcomes – and so they aren’t fairly

July 24, 2025

61

A brand new AI coding problem has revealed its first winner — and set a brand new bar for AI-powered software program engineers.

On Wednesday at 5pm PST, the nonprofit Laude Institute introduced the primary winner of the Ok Prize, a multi-round AI coding problem launched by Databricks and Perplexity co-founder Andy Konwinski. The winner was a Brazilian immediate engineer named Eduardo Rocha de Andrade, who will obtain $50,000 for the prize. However extra stunning than the win was his remaining rating: he received with appropriate solutions to only 7.5% of the questions on the check.

“We’re glad we constructed a benchmark that’s really onerous,” stated Konwinski. “Benchmarks ought to be onerous in the event that they’re going to matter,” he continued, including: “Scores could be completely different if the large labs had entered with their largest fashions. However that’s sort of the purpose. Ok Prize runs offline with restricted compute, so it favors smaller and open fashions. I like that. It ranges the enjoying discipline.”

Konwinski has pledged $1 million to the primary open-source mannequin that may rating larger than 90% on the check.

Just like the well-known SWE-Bench system, the Ok Prize exams fashions in opposition to flagged points from GitHub as a check of how nicely fashions can take care of real-world programming issues. However whereas SWE-Bench relies on a set set of issues that fashions can prepare in opposition to, the Ok Prize is designed as a “contamination-free model of SWE-Bench,” utilizing a timed entry system to protect in opposition to any benchmark-specific coaching. For spherical one, fashions have been due by March twelfth. The Ok Prize organizers then constructed the check utilizing solely GitHub points flagged after that date.

The 7.5% high rating stands in marked distinction to SWE-Bench itself, which at present reveals a 75% high rating on its simpler ‘Verified’ check and 34% on its tougher ‘Full’ check. Konwinski nonetheless isn’t positive whether or not the disparity is because of contamination on SWE-Bench or simply the problem of amassing new points from GitHub, however he expects the Ok Prize venture to reply the query quickly.

“As we get extra runs of the factor, we’ll have a greater sense,” he informed TechCrunch, “as a result of we anticipate folks to adapt to the dynamics of competing on this each few months.”

Techcrunch occasion

San Francisco
|
October 27-29, 2025

It’d appear to be an odd place to fall brief, given the wide selection of AI coding instruments already publicly accessible – however with benchmarks changing into too simple, many critics see initiatives just like the Ok Prize as a crucial step towards fixing AI’s rising analysis drawback.

“I’m fairly bullish about constructing new exams for present benchmarks,” says Princeton researcher Sayash Kapoor, who put ahead an identical concept in a current paper. “With out such experiments, we are able to’t really inform if the problem is contamination, and even simply concentrating on the SWE-Bench leaderboard with a human within the loop.”

For Konwinski, it’s not only a higher benchmark, however an open problem to the remainder of the trade. “If you happen to take heed to the hype, it’s like we ought to be seeing AI medical doctors and AI attorneys and AI software program engineers, and that’s simply not true,” he says. “If we are able to’t even get greater than 10% on a contamination free SWE-Bench, that’s the truth verify for me.”

Previous articleAll the things You Want To Know

Next articleRohde & Schwarz verifies Subsequent Era eCall for EN 17240:2024 Normal

A brand new AI coding problem simply printed its first outcomes – and so they aren’t fairly

Oh Lord, ‘Peacemaker’ Has Its Cunning Season 2 Music

This humanoid robotic can do cartwheels, handstands and roundhouse kicks at lower than $6,000

Your Comedian-Con 2025 Information: ‘Peacemaker,’ ‘Starfleet Academy’ and Extra Thrills

LEAVE A REPLY Cancel reply

Most Popular

AI’s function in the way forward for robotics: Insights from 3Laws

M&As that formed the take a look at and measurement business in final two years

Heavy-Elevate Drone Delivers Railway Cargo in Japan Shinkansen Trial

Greatest dropshipping merchandise and concepts for 2026 and past

Recent Comments

ABOUT US

POPULAR POSTS

AI’s function in the way forward for robotics: Insights from 3Laws

M&As that formed the take a look at and measurement business in final two years

Heavy-Elevate Drone Delivers Railway Cargo in Japan Shinkansen Trial

POPULAR CATEGORY