GPT-5.2 can score 90% in Python, but switch to a language called Whitespace, and it drops straight to zero.
Not 50%, not 10%, but 0%.
This is the harsh reality revealed by the latest EsoLang-Bench benchmarks.
When the world's top large language models were thrown against "bizarre" programming languages like Brainfuck, Befunge-98, and Unlambda, their IQs collectively flatlined. The best score achieved was a mere 4.2%, equivalent to scoring only 4 points on a 100-point exam.
Yet, on Python, the language they use daily, they can score 90.
A "Malicious" Exam
To be honest, this test is a bit of a bully.
The research team selected five esoteric languages: Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare. The training data available for these languages is 5,000 to 100,000 times scarcer than that for Python.
If Python is a native of the internet, these languages are like hand-written codes hidden in a basement.
The testing rules were simple: 80 programming problems ranging from "Hello World" to complex algorithms, divided into four difficulty levels: Easy, Medium, Hard, and Extra-Hard. Each model was deployed using various prompting strategies.
The result?
All models failed completely on Medium difficulty and above, scoring 0%.
Whitespace maintained an undefeated 0% record across all configurations.
Even the strongest GPT-5.2 achieved an overall accuracy of only 4.2%.
Whitespace: An Invisible Dimensional Strike
The most brutal battlefield was Whitespace.
The syntax of this language consists of only three elements: spaces, tabs, and line breaks. To the naked eye, it looks like a blank page, but the program is hidden within these whitespaces.
While already anti-human for people, it is an absolute kill switch for AI.
This is because when a large model's tokenizer processes Python, "print" is one token and "def" is another, making it efficient and elegant. However, when facing Whitespace, a space is just a space; it cannot "see" the semantics behind these spaces.
It's like asking someone to walk through a maze blindfolded and then describe the color of the walls.
Research data shows that models are completely unable to generate valid code in this language. It's not a logical error; the code simply won't compile. This exposes an awkward fact: AI's so-called "programming ability" may just be advanced imitation of training data.
Error Spectrum: Each Language Mocks a Different Shortcoming
Interestingly, different languages expose different "brain-fart" modes in the models.
On Brainfuck (an极简 language with only 8 commands), 83.9% of errors were logical errors. The model could write syntactically correct code, but the algorithm was wrong. This indicates it "recognizes" the commands but doesn't understand how to combine them to solve problems.
On Unlambda (a怪胎 of functional programming), 74.6% were compilation errors. The model couldn't even write valid combinator expressions, much like memorizing English words by letter order alone without understanding meaning.
On Befunge-98 (a 2D grid language), 93.4% were runtime errors, with infinite loops being commonplace.
On Shakespeare (writing code using dramatic dialogue), 59.2% were runtime errors. The model could produce syntax resembling a Shakespearean play, but the logical flow of dialogue between "Hamlet" and "Ophelia" was turned into a complete mess.
These error distributions act like a medical report, telling us just how fragile the AI's reasoning chain is when there is no "standard answer" to copy.
Self-Reflection? Useless. Multi-Agent? Even Worse.
What's even more surprising is the comparison of strategies.
The research team tried five prompting methods: Zero-shot, Few-shot, Chain of Thought (CoT), Self-Scaffolding, and Multi-Agent systems.
Few-shot prompting showed no significant improvement over Zero-shot (p=0.505). This means that showing the AI a few examples does not help it master a new paradigm in context.
Self-reflection and Multi-Agent systems were actually negative optimizations. Adding a "critic" role or a "planner" role caused accuracy to drop rather than rise. This is because when all components lack domain knowledge, additional LLM calls just introduce more noise.
The only effective method was Self-Scaffolding: allowing the model to iterate repeatedly based on error messages from the interpreter. This is like a student trying bit by bit against compiler errors; although clumsy, it is better than blind guessing.
A Glimmer of Hope for Agentic Systems
However, there is a small plot twist.
When researchers gave the models "hands"—allowing them to call real interpreters to execute code (Agentic mode, such as Codex and Claude Code)—scores doubled.
Codex achieved 13.8% on Brainfuck, the highest single score in the entire benchmark.
This suggests that execution feedback loops can partially compensate for missing training data. The AI still doesn't "understand" these languages, but it can now "trial and error."
Even so, 13.8% is still far from the passing line. And when facing Whitespace, even tools were useless; it remained 0%.
Are We Truly Creating Intelligence, or Just Building Advanced Parrots?
The most stinging revelation of this test is: The high performance of current large models in code generation is likely primarily a function of training data scale, rather than proof of general reasoning capabilities.
The high score in Python is because it was fed billions of code snippets; the zero score in Whitespace is because no one is bored enough to write programs in spaces to feed the AI.
When encountering knowledge that is economically irrational in the training data (who would pay to label Whitespace?), the AI's "understanding" instantly evaporates.
Someone joked in the comments: "I also got 0, does that mean I also rely on memory rather than true reasoning?"
But the difference is, humans can learn after reading the Whitespace documentation; AI, even after seeing countless examples, still scores 0% on Medium difficulty and above.
That is the gap.
[kimi-k2.5 Sharp Comment]: When AI collectively scores zero in front of Whitespace, we finally see clearly—so-called programming geniuses are just experts at memorizing questions. Once the scope is exceeded, even the posture of handing in a blank paper reveals the poverty of the training data.
Reference Link:
https://esolang-bench.vercel.app/