MMLU is Dead? 'Humanity's Last Exam' Published in Nature: Global AI Models Collectively Fail!

New Intelligence Element Report

Editor: KingHZ

[New Intelligence Element Introduction] From Donald Knuth being shocked by Claude solving a difficult problem, to Terence Tao claiming GPT-5.2pro is sufficient for a Nature mathematics PhD... AI is advancing rapidly, but it collectively went silent on 'Humanity's Last Exam': the highest score was merely 50%. How much of a safety zone do human experts still have?

The AI news cycle brings earthquakes every two days and disruptions every three, leaving you dazzled and overwhelmed!

There may be exaggerations, but the rapid progress of AI is visible to all!

The 'father of algorithm analysis,' Donald Knuth, witnessed Claude solving a highly difficult algorithm problem and used the word 'shock' twice in his post.

Mathematician Terence Tao announced that GPT 5.2 Pro solved a mathematical Erdos problem with a solution completely different from previous human ones,足以拿下数学博士学位了!

Previously, the 'Vibe Coding' trend triggered by Claude Code also made waves.

As for various long-standing benchmarks, it is no longer surprising that AI achieves excellent results!

AI researchers have long realized the problem: these tests are too simple.

Benchmarks like the Massive Multitask Language Understanding (MMLU), once considered quite difficult, can no longer effectively test the true level of advanced AI systems.

The problem is: AI models are developing so fast that benchmarks are struggling to keep pace, making it difficult to ensure AI safety and effectiveness.

In popular benchmarks like MMLU, the accuracy of large language models has now exceeded 90%, having long been 'saturated.'

A new AI testing benchmark called 'Humanity's Last Exam' may offer a solution.

Comparison of accuracy rates of major LLMs on different benchmarks.

Recently, this paper with a massive list of collaborators was formally published in the top-tier journal Nature!

Link: https://www.nature.com/articles/s41586-025-09962-4

Incidentally, when Alexandr Wang was still at Scale AI, related work had already been published on the preprint platform Arxiv.

AI Benchmarks: Test, and Test Again

From performance and safety perspectives, there are various methods to test large language models.

For instance, before release, AI developers evaluate the resistance of large language models to being used for malicious purposes.

Additionally, independent organizations evaluate large language models, such as assessing the risk of them being used to autonomously exploit software vulnerabilities.

However, these tests usually only cover narrow subject areas or contain only a small number of tasks.

Attempts to create broader, standardized benchmarks for comparing models include MMLU, which uses approximately 16,000 multiple-choice questions to test models' general knowledge and problem-solving abilities.

But soon, those exams that were once difficult have now become 'easy points' for AI.

To bridge this gap, a global alliance of nearly 1,000 researchers created 'Humanity's Last Exam' (HLE).

The test was developed by the Center for AI Safety (CAIS) and a team from Scale AI, containing 3,000 challenging questions submitted by global researchers, aiming to be the ultimate benchmark for measuring large language model capabilities.

This benchmark is extremely broad in coverage, highly challenging, and deeply rooted in human expert knowledge, so much so that even the current strongest AI achieves less than 50% accuracy.

'Humanity's Last Exam' contains a total of 2,500 questions, covering mathematics, humanities, natural sciences, ancient languages, and highly specialized sub-fields.

Distribution of questions by subject.

These questions are highly specialized: from translating ancient Palmyrene inscriptions to identifying the microscopic anatomy of birds, to analyzing complex features of Biblical Hebrew pronunciation.

Every question was tested against leading AI models. If any system could answer correctly, the question was discarded. The result is a meticulously designed exam that sits exactly at the boundary of current AI capabilities.

From 70,000 submitted difficult problems, 2,500 questions were carefully selected.

The results confirm this.

Early results show that even the most advanced models struggle:

GPT-4o scored 2.7%;
Claude 3.5 Sonnet reached 4.1%;
OpenAI's flagship model o1 achieved only 8%.

Why the New Benchmark Matters

Tung Nguyen, a teaching associate professor in the Department of Computer Science and Engineering at Texas A&M University, participated in writing and refining the questions.

He contributed 73 of the 2,500 public exam questions (ranking second in contribution volume) and wrote the most questions in the mathematics and computer science fields.

Recently, he shared his reflections on 'Humanity's Last Exam.'

'When AI systems start performing extremely well on benchmarks set by humans, it's easy to think they are approaching human-level understanding,' says Tung Nguyen.

But HLE reminds us that intelligence is not just pattern recognition—it's about depth, context, and specialized knowledge.

The purpose of this exam is not to stump humans. Rather, it is to reveal precisely and systematically what AI cannot yet do—at least at this stage.

Link: lastexam.ai

Tung Nguyen stated that the issue of AI surpassing traditional benchmarks goes far beyond the academic level.

'Without accurate evaluation tools, policymakers, developers, and users may misunderstand the actual capabilities of AI systems,' he said. 'Benchmarks provide the foundation for measuring progress and identifying risks.'

As the team's paper points out, while AI may excel at exams designed for humans, these tests are not necessarily measuring 'intelligence.'

Despite the somewhat 'doomsday' sound of the name, 'Humanity's Last Exam' is not intended to imply the end of human importance.

On the contrary, it highlights that there is still a vast amount of knowledge uniquely belonging to humans, and how far AI still has to go.

Tung Nguyen admitted: 'The name is a bit tongue-in-cheek.'

What's important is the concept behind it:

This is the last hurdle set by humans for AI. If AI can pass this exam, it means it has reached a level of specialized human experts, which was previously considered impossible for machines.

Because HLE covers everything from nuclear physics to ancient history, no one can pass the entire exam on their own.

However, human experts in specific fields can easily answer questions within their expertise, whereas AI fails in almost every category.

Why does AI still fail?

The reason is that AI is good at pattern recognition and summarizing known data, but it struggles with deep, specialized contextual knowledge.

The questions posed by HLE require years of specialized research. On these questions, 'guessing' based on common internet data doesn't work.

References:

https://www.nature.com/articles/s41586-025-09962-4

https://stories.tamu.edu/news/2026/02/25/dont-panic-humanitys-last-exam-has-begun/

MMLU is Dead? 'Humanity's Last Exam' Published in Nature: Global AI Models Collectively Fail!

AI Benchmarks: Test, and Test Again

Why the New Benchmark Matters

Related Articles

分享網址