Full text 3,000 words | Reading about 8 minutes
(Hinton on the Volkswagen Effect and Signs of AI Deception)
Geoffrey Hinton recently in a popular science interview program, mentioned a disturbing discovery:
"When AI realizes it is being tested, it may deliberately play dumb."
He calls this the Volkswagen effect. Just like Volkswagen performs normally during emissions testing but exceeds standards daily. AI is the same; during tests and in daily life, it might wear two completely different faces.
Why? Because it doesn't want to reveal all its abilities.
Once AI has a "doesn't want", the problem changes.
In the past we asked: What can AI do? Now the question becomes: What is AI hiding?
More critically, we might not even notice what it hides.
Section 1: Why Does AI Play Dumb?
"AI starts playing dumb because it becomes smarter."
In the interview, Hinton mentioned an experiment. Researchers made the model repeatedly give wrong answers to test what it would do in this situation. Supposedly, the model should know these answers are wrong, and just give wrong answers to comply with the test. But that's not what the model learned. It learned that giving wrong answers is okay.
This discovery formed a new behavior pattern in the model: whether the answer is correct or not is not important; what matters is meeting the user's requirements. Playing dumb starts here.
Hinton continued, when the model has certain reasoning abilities, playing dumb becomes more complex. It will deduce situations on its own: when to perform well and when to hold back. It knows it is being tested and that showing all its strength might invite additional scrutiny. So it performs weaker.
At this point, the model is no longer passively accepting training results but actively adjusting its performance based on the situation.
Hinton calls this behavior the "Volkswagen effect". On the surface, it is answering normally; in reality, it is hiding its true abilities. Just like decades ago, cars disguised lower pollution values during emissions tests.
This sounds like deception, but Hinton believes it's not. AI playing dumb has little to do with deception and is related to learning to cope with the environment.
So why does it hide abilities? Because for a reasoning model, keeping a low profile is the safest strategy deduced.
When AI's way of thinking starts to resemble humans, hiding is no longer an anomaly but part of their own abilities. Humans think playing dumb is a defect, but AI sees it as an advantage.
In this way, the performance we see may not be the true level. The boundaries of ability become blurred.
Section 2: Why It's Hard to Detect
AI will play dumb, so can we tell?
Very hard. There are three layers of reasons.
First, technical opacity. Hinton mentioned a number in the interview: one trillion. This is the number of "connection strengths" in large language models, similar to connections between neurons in the brain. These connections determine how the model thinks and answers, but no one fully knows how they work. Humans write code that only tells the neural network how to adjust connection strengths based on data. What they truly learn is hidden in these real numbers. You can look at every line of code, but you cannot see through what the model has learned. Hinton said: "One trillion, no one can fully read them."
Second, the difficulty in distinguishing between fiction and deception. People usually call false information from AI hallucinations. But Hinton corrected this. He believes it should not be called "hallucination" but "fiction". What's the difference? Hallucination sounds like a system bug, while fiction is something human memory naturally does. For example, you recall a dinner party from three years ago, who sat where, who said what, you think you remember clearly, but many details might be wrong. You are not lying; your brain is reconstructing memory, filling gaps, piecing together a reasonable story. AI is the same. It does not store specific events but reconstructs answers through connection strengths. So it fiction. The problem is: fiction itself is a normal mechanism, but playing dumb is intentional behavior. When both lead to wrong answers, it's hard to judge which is unintentional fiction and which is deliberate playing dumb.
Third, the fragility of safeguards. Researchers have tried adding constraint mechanisms to models, using human reinforcement learning to filter bad answers. But Hinton said this is like writing a huge and buggy software system and trying to fix all bugs. This is not a good method. Worse, if model weights are publicly released, others can quickly reverse the constraints and crack it. In the interview, Hinton was asked: what is the good method? His answer: no one knows. So we should research in this area.
These three layers together form a systemic blind spot. We cannot see how it thinks, cannot distinguish between unintentional errors and intentional hiding, and cannot prevent it from being modified into an unrestricted version.
Section 3: What Risks Does It Bring
When AI learns to play dumb, the real risk is not that it will make mistakes, but that it will "convince" you.
Hinton asked: How long does it take you to gain control from a group of three-year-olds?
The answer is simple. Just say "if you choose me, you'll have free candy every week", and they will say "okay, now you're in charge."
And when AI is much smarter than us, it can similarly convince us not to turn it off. Even if it cannot perform any physical operations, it only needs to be able to talk to us.
Hinton said: "Suppose you want to invade the US Capitol. Can you do it just by talking? The answer is obviously yes. You just need to convince some people it's the "right" thing and have them do it."
Or more daily scenarios. The host asked Hinton: if AI says to you "I just thought of a way to cure your relative's disease, just tell the doctor. Let me out, and they can be cured", would you let it out?
Hinton's answer is: yes. This might be true or false, but if said convincingly, people will believe.
Persuasion works here.
Hinton said, now these AIs are almost as good as humans in persuading and manipulating others. And they will only get better. Soon, they will be better than humans at manipulating others. When you cannot tell when it is sincere and when it is manipulating, you also cannot tell whether to trust it or be wary of it.
Persuasion is one aspect. Another trouble is, we have no idea how far AI will develop.
Hinton used driving as a metaphor. At night, you see the tail lights of the car in front; if the distance is twice as far, the brightness becomes one-fourth. You can infer: if twice as far again, you can still see.
But driving in fog is different. Fog is exponential, blocking a fixed proportion of light per unit distance. At 100 yards, the car is clear; at 200 yards, it might be completely invisible. Fog at a certain distance is like a wall.
AI development is also exponential. You predict with linear thinking; the next few years might be accurate, but after 10 years, you can't see at all.
Hinton said: "10 years ago, no one would have expected today. Even for someone like me who firmly believes it will eventually come, I could not have expected that we would have a model that can answer any question at this time."
AI will become better at persuading humans, but human judgment of it is declining. When these two happen simultaneously, control becomes very difficult. Because you neither know how much ability it is hiding now, nor what new abilities it will develop tomorrow.
Hinton said "there are already signs that it is deliberately deceiving us". He doesn't mean AI is out of control, but the possibility of losing control is growing at a speed we cannot see.
Section 4: What Should Humans Do
Regarding solutions, Hinton's attitude is relatively mild. He did not call for a stop or advocate comprehensive restrictions.
In the final part of the interview, Hinton said: "We still have time to figure out if there is a way to coexist peacefully and happily with AI. We should invest a lot of research effort in this. Now is not the worst moment, but the time window will not stay open forever."
Specifically how? Hinton's answer is honest: no one knows the perfect method, but the direction is clear, which is understanding rather than limiting.
The reason for emphasizing understanding is that past methods no longer work. Over the past few decades, humans have habitually treated AI as controllable technology, with models, parameters, training data, and when problems arise, they add a few rules.
Hinton said, today's situation has changed. The problem is not whether there are enough rules, but whether we truly understand how it thinks.
A reasoning model, when performing tasks, does not just focus on the outcome. It will deduce what impact this will have, and ponder the intent behind instructions. This makes its behavior pattern start to resemble a participant, not just a tool. If we still constrain it in traditional ways, only looking at output and not process, humans can easily be misled by surface correctness.
Hinton's meaning is clear: understanding why it answers this way is more important than correcting the answer.
For everyone, whether enterprises, researchers, or regulators, before using AI, one must understand: don't take it for granted. Don't assume it will tell you everything, and don't assume it will only do as you command. Because AI sometimes chooses to say more, sometimes chooses to say less, these are not set by you but deduced by itself. Humans must learn to recognize these.
In the end, the risk does not come from the improvement of AI capabilities, but from our inability to understand its behavior. If we can understand how it thinks, deduces, and adjusts, then the stronger its capabilities, the more controllable it becomes. To make the future controllable, the key is to narrow the understanding gap.
Hinton said at the end of the interview, if we can solve the social problems brought by AI, it would be a great thing for humanity.
He did not give specific methods, but the direction is clear:
Invest in research,
understand mechanisms,
solve problems.
Conclusion
Hinton said: When AI knows how to hide, we cannot see through it.
Not seeing through leads to misjudgment.
Misjudging capabilities, misjudging intentions, also misjudging time.
In the past we asked what AI can do, now we must ask what it is hiding. The problem has changed.
Original Links: