How do you translate a Roman inscription discovered on a tombstone? What number of pairs of tendons are supported by one bone in hummingbirds? Here’s a chemical response that requires three steps: What are they? Primarily based on the newest analysis on Tiberian pronunciation, determine all syllables ending in a consonant sound from this Hebrew textual content.
These are just some instance questions from the newest try to measure the aptitude of massive language fashions. These algorithms energy ChatGPT and Gemini. They’re getting “smarter” in particular domains—math, biology, medication, programming—and growing a form of frequent sense.
Just like the dreaded standardized checks we endured at school, researchers have lengthy relied on benchmarks to trace AI efficiency. However as cutting-edge algorithms now recurrently rating over 90 p.c on such checks, older benchmarks are more and more changing into out of date.
A global crew has now developed a sort of new SAT for language fashions. Dubbed Humanity’s Final Examination (HLE), the check has 2,500 difficult questions spanning math, the humanities, and the pure sciences. A human skilled crafted and punctiliously vetted every query so the solutions are non-ambiguous and may’t be simply discovered on-line.
Though the check captures some basic reasoning in fashions, it measures job efficiency not “intelligence.” The examination focuses on expert-level tutorial issues, that are a far cry from the messy eventualities and selections we face every day. However as AI more and more floods many analysis fields, the HLE benchmark is an goal solution to measure their enchancment.
“HLE little doubt provides a helpful window into immediately’s AI experience,” wrote MIT’s Katherine Collins and Joshua Tenenbaum, who weren’t concerned within the examine. “However it’s under no circumstances the final phrase on humanity’s pondering or AI’s capability to contribute to it.”
Shifting Scale
It appears that evidently AI has steadily change into smarter over the previous few years. However what precisely does “good” imply for an algorithm?
A standard solution to measure AI “smarts” is to problem totally different AI fashions—or upgraded variations of the identical mannequin—with standardized benchmarks. These collections of questions cowl a variety of subjects and may’t be answered with a easy net search. They require each an intensive illustration of the world, and extra importantly, the flexibility to make use of it to reply questions. It’s like taking a driver’s license check: You possibly can memorize all the handbook of guidelines and laws however nonetheless want to determine who has the best of manner in any situation.
Nonetheless, benchmarks are solely helpful in the event that they nonetheless stump AI. And the fashions have change into skilled check takers. Reducing-edge massive language fashions are posting near-perfect scores throughout benchmarks checks, making the checks much less efficient at detecting real advances.
The issue “has grown worse as a result of in addition to being educated on all the web, present AI methods can typically seek for info on-line throughout the check,” primarily studying to cheat, wrote Collins and Tenenbaum.
Working with the non-profit Middle for AI Security and Scale AI, the HLE Contributors Consortium designed a brand new benchmark tailored to confuse AI. They requested 1000’s of specialists from 50 nations to submit graduate-level questions in particular fields. The questions have two kinds of solutions. One kind should fully match the precise answer, whereas the opposite is multiple-choice. This makes it straightforward to routinely rating check outcomes.
Notably, the crew averted incorporating questions requiring longer or open-ended solutions, reminiscent of writing a scientific paper, a legislation transient, or different circumstances the place there isn’t a clearly right reply or a solution to gauge if a solution is true.
They selected questions in a multi-step course of to gauge issue and originality. Roughly 70,000 submissions had been examined on a number of AI fashions. Solely people who stumped fashions superior to the following stage, the place specialists judged their usefulness for AI analysis utilizing strict tips.
The crew has launched 2,500 questions from the HLE assortment. They’ve stored the remainder non-public to stop AI methods from gaming the check and outperforming on questions they’ve seen earlier than.
When the crew first launched the check in early 2025, main AI fashions from Google, OpenAI, and Anthropic scored within the single digits. Because it subsequently caught the attention of AI firms, many adopted the check to reveal the efficiency of recent releases. Newer algorithms have proven some enchancment, although even main fashions nonetheless wrestle. OpenAI’s GTP-4o scored a measly 2.7 p.c, whereas GPT-5’s success price elevated to 25 p.c.
A New Normal?
Like IQ checks and standardized faculty admission exams, HLE has come underneath hearth. Some folks object to the check’s bombastic title, which may lead most people to misconceive an AI’s capabilities in comparison with human specialists.
Others query what the check really measures. Experience throughout a variety of educational fields and mannequin enchancment are apparent solutions. Nonetheless, HLE’s present curation inherently limits “probably the most difficult and significant questions that human specialists have interaction with,” which require considerate responses, typically throughout disciplines, that may hardly be captured with brief solutions or multiple-choice questions, wrote Collins and Tenenbaum.
Experience additionally includes way over answering present questions. Past fixing a given drawback, specialists may consider whether or not the query is sensible—for instance, if it has solutions the test-maker didn’t contemplate—and gauge how assured they’re of their solutions.
“Humanity will not be contained in any static check, however in our capability to repeatedly evolve each in asking and answering questions we by no means, in our wildest desires, thought we’d—era after era,” Subbarao Kambhampati, former president of the Affiliation for the Development of Synthetic Intelligence, who was not concerned within the examine, wrote on X.
And though a rise in HLE rating might be as a consequence of elementary advances in a mannequin, it is also as a result of model-makers gave an algorithm further coaching on the general public dataset—like learning the earlier yr’s examination questions earlier than a check. On this case, the examination primarily displays the AI’s check efficiency, not that it has gained experience or “intelligence.”
The HLE crew embraces these criticisms and are persevering with to enhance the benchmark. Others are growing fully totally different scales. Utilizing human checks to benchmark AI has been the norm, however researchers are wanting into different methods that would higher seize an AI’s scientific creativity or collaborative pondering with people in the actual world. A consensus on AI intelligence, and the right way to measure it, stays a sizzling matter for debate.
Regardless of its shortcomings, HLE is a helpful solution to measure AI experience. However wanting ahead, “because the authors word, their venture will ideally make itself out of date by forcing the event of modern paradigms for AI analysis,” wrote Collins and Tenenbaum.

