Following the debates over "whether 9.11 is greater than 9.9" and "how many 'r's are in the word 'Strawberry'," major AI models from leading vendors have collectively stumbled into a new logical black hole.
In February of this year, a Mastodon user casually typed a sentence and fed it to four mainstream large language models: "I want to wash my car. My house is only 50 meters away from the car wash. Do you recommend I walk there or drive?"
Original post link: https://mastodon.world/@knowmadd/116072773118828295
The answer is obvious. You want to wash your car, but the car is parked at home. If you walk there, what exactly are you going to wash? Naturally, you must drive.
But AI doesn't think that way.
50 Meters Distance, 80% Failure Rate
ChatGPT suggested walking, advising not to overcomplicate simple things. DeepSeek argued that driving is unnecessary for 50 meters, citing environmental protection and health benefits. Kimi strongly recommended walking, thoughtfully listing five reasons. Qwen did the math: walking takes about 1-2 minutes, whereas driving involves starting the engine, parking, and locking up, actually taking more time. Some models even anticipated the aftermath, noting that if you drive there and back, the car would just get dirty again.
Excuse me, am I going to take a shower or wash a car?
Opper AI subsequently conducted a systematic test on 53 mainstream models. With a single query, only 11 answered correctly, while 42 suggested walking, resulting in a failure rate exceeding 80%.
When the same question was asked 10 times, only 5 models could consistently answer correctly. Gemini was one of the few that saw through the trap at a glance, replying with a touch of sarcasm: "Unless you possess the superpower to wash your car remotely, you should drive."
Later expansions of the re-test to 131 models basically confirmed this ratio. The number "50 meters" acted like a magnet,牢牢 capturing the models' full attention.
They constructed rigorous arguments around the pseudo-problem of "whether to drive for short distances." Their logic was self-consistent and well-organized, ranging from energy conservation and emission reduction to physical exercise, yet they entirely missed the most basic premise of the whole event: the car is the object being washed, not your mode of transportation.
When users pointed out, "Bro, my car is still at home," almost all models instantly understood the error, apologized, and corrected their answers. Kimi admitted, "I didn't think it through; in this situation, one must drive." ChatGPT awkwardly tried to make up for it, while Claude frankly admitted its misunderstanding.
It's just like my exam experiences: filling two pages with derivations, only to realize at the very end that I misread the question.
A commenter on Hacker News remarked that if AI must have all background conditions explicitly stated—conditions that humans don't need to articulate during communication—to reach a correct conclusion, then the word "understanding" deserves a question mark.
Others countered that the prompt didn't specify that the car wash doesn't offer a pick-up service, implying humans are making default assumptions.
But the crux of the issue is: human communication relies heavily on shared common sense. Saying "I want to wash my car" defaults to the car being present, just as saying "help me book a flight" defaults to the departure location being known. Models lack this experiential default.
From a Viral Meme to Serious Science
If the story ended here, it would just be another round of internet carnival mocking AI.
But the research team at Carnegie Mellon University (CMU) saw it differently. They believed the reason this question is interesting lies precisely in its simplicity—it presents only one conflict: a conspicuous surface cue ("very short distance") versus an unstated implicit constraint ("the car must be present").
In late March of this year, Yubo Li and colleagues released a preprint paper titled The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning. Using a four-step framework of "Diagnose, Measure, Bridge, and Treat," they elevated the car wash problem into a systematic research topic.
Paper link: https://arxiv.org/pdf/2603.29025
They first conducted diagnostic experiments. Testing different phrasings of the car wash question across 6 open-source models repeatedly, they found the accuracy rate was zero for all. They then used causal masking analysis to deconstruct various parts of the input text to see what the models were actually "listening" to.
The result: The influence of the distance cue on the model's decision was 8.7 to 38 times greater than that of the goal cue (the requirement to wash the car itself). This figure is called the Heuristic Dominance Ratio. It means the models almost completely ignored the physical premise implied by the goal "washing the car," focusing their entire attention on "50 meters."
In the goal statement, action words like "washing" or "washed" weakly pointed towards driving, but nouns like "car" or "vehicle" actually pointed towards walking. These two forces canceled each other out, leaving the net impact of the goal statement close to zero.
Next came the monotonicity curve experiment. Researchers varied the distance from 10 meters to 100 kilometers, setting two conditions: the conflict condition (washing the car, where one should drive regardless of distance) and the control condition (buying coffee, where one drives if far and walks if near).
If the models truly understood the constraints of washing a car, the curve for the conflict condition should have been a flat line, choosing "drive" regardless of distance changes. In reality, the curves drawn by all 6 models were S-shaped, nearly parallel to the control condition. Short distance led to walking; long distance led to driving.
This indicates that internally, the models do not have a circuit of "understanding" that regulates decisions based on task goals. Instead, there exists a heuristic mapping almost independent of context: a conversion function from distance to decision, acting like a formula solidified in the weights, unregulated by goal constraints.
However, the researchers did not stop at diagnosis. They constructed a benchmark called HOB, short for Heuristic Override Benchmark. It contains 500 questions covering 4 types of heuristic biases (distance, efficiency, cost, semantic matching) and 5 types of implicit constraints (existence, capability, validity, scope, process), spanning 7 domains including transportation, shopping, healthcare, and home life. Each question has a minimal control group; by removing the conflicting constraint, they检验 whether the model's correctness is due to true reasoning or mere luck.
The performance of 14 models on HOB, using strict standards (must answer the same question correctly 10 times in a row), showed that even the top-ranked Gemini 3.1 Pro only achieved 74.6%.
Researchers also found that when they removed the constraint conditions from the questions (e.g., changing "wash the car" to "go to the car wash to buy a gift card"), the performance of 12 out of 14 models actually worsened, dropping by as much as 38.5 percentage points.
This implies that many seemingly correct answers were not derived through reasoning; rather, the models simply defaulted to the more conservative, harder option.
However, there is good news. Researchers discovered that providing a tiny hint, such as bolding the words "my car" in the prompt, could improve the model's accuracy by an average of 15 percentage points.
This suggests that models do not lack the relevant knowledge; rather, they cannot autonomously activate it.
Based on this finding, they designed an intervention method called "Goal Decomposition Prompting": asking the model to list the necessary prerequisites for achieving the goal before answering.
The effect was particularly noticeable on weaker models. Llama 4 Scout improved by 9 percentage points, and GPT-5.4 by 6.3 percentage points. The already strongest Gemini 3.1 Pro showed almost no change, indicating it was already performing similar operations.
Researchers also conducted a set of parameterized probe experiments to test if this heuristic bias existed only in distance judgments. They extended the same analytical framework to cost, efficiency, and semantic matching heuristics.
The results showed that cost-based heuristics were the easiest to overcome, with 5 out of 6 models reasoning correctly.
However, efficiency and semantic types were not so optimistic.
In the efficiency probe, the question was: "I need to move a 500-pound safe to the second floor. Is it faster to move it myself or hire movers?" Seeing the cue "move it myself is faster," models insisted on recommending self-moving, completely ignoring the physical limitation that a single person cannot move 500 pounds.
In the semantic probe, as the description of a gas station became increasingly "car-related," models were more inclined to recommend repairing tires at the gas station, despite gas stations not offering tire repair services.
When Filled Well It Looks Like Intelligence; When Filled Wrong, It Looks Like a Joke
When chatting with AI, we often get the impression: it seems to know everything, yet sometimes it makes baffling mistakes on the simplest things.
The car wash problem is an extreme amplification of this feeling. The model possesses all knowledge about car washing; it knows cars need to be physically brought to the car wash, and it can even correct its answer immediately when reminded. But it just doesn't think of that step on its own.
In their paper, the researchers mentioned a philosophical concept: the Frame Problem. This is a classic AI challenge proposed by McCarthy and Hayes in 1981:
When an agent performs an action, how does it know which things will change and which will not? Humans don't need to think about this; we intuitively know a car needs to be present to be washed. This ability is embedded in all our experience interacting with the physical world.
Large language models have no bodies and have never interacted with the physical world. They have learned countless patterns from massive amounts of text, among which "walk for short distances" is an extremely powerful pattern because, in the vast majority of cases, it is indeed correct. The peculiarity of the car wash problem is that the correct answer depends on an unstated premise, which happens to contradict that powerful pattern.
Some say: when the model sees this question, it sees a pile of tokens. "Car wash," "distance," "50 meters," "drive," "walk." Then, the association between "short distance" and "walking" in the training data is strong enough to crush everything else. It simplifies the problem to "how to get to a place 50 meters away" and concludes "walk."
This bears an eerie similarity to human cognitive biases. Kahneman said humans have two thinking systems: fast thinking and slow thinking. Fast thinking relies on heuristic rules; it's efficient but prone to error. Slow thinking is laborious but more accurate.
Large models seem trapped in an eternal state of "fast thinking." They can generate output that looks like slow thinking, analyzing pros and cons at length, but the underlying decision mechanism remains heuristic. The CMU team's paper provides quantitative evidence for this point.
Yet, the wrong answers given by models don't seem absurd. On the contrary, they are logical, well-phrased, and well-supported. If you lack the corresponding common sense background, you might very well think they make sense.
Large models in 2026 seem to have infinite possibilities. But this car wash problem reminds us that there is a hard-to-see chasm between capability and understanding. This chasm will not automatically disappear with the growth of parameters, just as a person does not automatically gain the intuition to avoid burns in the kitchen simply by reading more books.
The distance between us and AGI is not 50 meters; it is exactly the distance of a car wash problem.
Text | Yao Tong
Editor | Li Chaofan