Anthropic's Heavyweight Study: The Ultimate Risk of AI is Not Awakening, but Random Crashes

The ultimate risk of superintelligent AI may not be deliberate betrayal, but it getting tangled in an unpredictable mess within the labyrinth of logic.

The latest heavyweight study by Anthropic, EPFL, and the University of Edinburgh has unveiled the enigmatic relationship between model scale, task complexity, and loss of control risks.

The research found that as reasoning steps increase, AI is more likely to exhibit random chaos known as incoherence, not awakening and then steadfastly executing a wrong goal as depicted in science fiction, but losing itself in massive computations.

The foundation of intelligent failure consists of bias and random crashes

We habitually imagine the risks of AI as some kind of premeditated malice.

This is like a driver deliberately driving the car off a cliff, with clear goals and trajectory.

Academia classifies this error as Bias, representing the model stubbornly pursuing an undesired goal.

Another risk is more like the driver suddenly getting drunk. The wheels wobble left and right, the trajectory is completely irregular, and no logic can predict the next move.

This is the trouble brought by random crashes (Variance).

Researchers define the degree of failure dominated by random fluctuations as incoherence (Incoherence).

The formula decomposes error into the square of bias plus random crashes. Incoherence measures the proportion of random crashes in total error.

When this value approaches 0, the model's errors are very robust, and even when wrong, they are extremely regular. When it approaches 1, the model becomes a complete mess.

Current top models are exhibiting obvious drunken characteristics when dealing with complex tasks.

The randomness they produce during reasoning far exceeds systematic bias. Future safety hazards may more likely come from unpredictable industrial accidents rather than high-IQ deliberate rebellion as in sci-fi movies.

The figure describes two paths of AI loss of control.

The top-left shows截然不同的 results from resampling in programming tasks; top-right shows the mathematical logic of decomposing error into bias and random crashes; bottom-left reveals that models become more incoherent as task complexity increases; bottom-right shows the complex impact of model scale on incoherence.

Extended thinking time induces systematic logical collapse

Researchers observed model performance on multiple challenging benchmarks such as GPQA (graduate-level scientific QA) and SWE-BENCH (software engineering benchmark).

They found a disturbing trend. The more steps AI spends on thinking and acting, the more inconsistent its performance becomes.

This is like asking a person to do a ten-step mental arithmetic.

Tiny deviations in the first step amplify along the reasoning chain (CoT).

By the last step, the model's answer often deviates from the logical track. This phenomenon is vividly exemplified in frontier models like Sonnet 4 and o3-mini.

By comparing samples above and below median reasoning length, it was proven that even for tasks of the same difficulty, longer reasoning paths directly lead to higher incoherence.

Natural overthinking is the culprit of chaos. Even when these long reasoning occasionally guess the correct answer, the process is full of random bumps.

Under the theoretical framework of Hot Mess, as intelligent entities increase in capability, their behavior becomes harder to explain with a single goal.

They are no longer pure goal optimizers. In high-dimensional state spaces, models are more like taking a random walk without an end.

Scale amplification exacerbates randomness in complex tasks

Simply stacking compute and parameters seems unable to cure this logical mental drain.

For simple tasks, larger models do perform more robustly, with incoherence decreasing as scale increases.

But when facing truly challenging problems, the situation reverses.

In the MMLU (massive multitask language understanding) benchmark, the QWEN3 family shows an interesting evolutionary trajectory.

As parameters increase from 1.7B to 32B, the model's bias and random crashes in handling simple problems both decrease. They become both smarter and more reliable.

When dealing with the most difficult questions, although the overall error rate of larger models decreases, they reduce bias much faster than they reduce random crashes.

Larger models tend to reach truth through an unstable way occasionally. They behave more wildly and unpredictably when wrong compared to smaller models.

This phenomenon was verified in simulation optimizer experiments.

Researchers trained Transformer models to imitate a certain mathematical optimization path. The larger the model, the faster it learns the objective function. Its ability to maintain long-term coherent action sequences grows relatively slowly.

Debiasing mechanisms cannot completely eliminate systeminternal friction

Ensembling is considered a painkiller for alleviating chaos.

By having the model attempt the same problem multiple times and take the average, random crashes decrease rapidly with the number of attempts.

In the test of o4-mini, when the ensemble size doubles, random crashes shrink proportionally.

Many actions in the real world are irreversible. AI agents often have no chance to retry when performing operations like deleting databases, sending emails, or physical actions.

In such one-shot scenarios, the power of ensembling cannot be utilized. The model's inherent incoherence becomes a time bomb that could explode at any moment.

Although increasing reasoning budgets can improve accuracy, it cannot fundamentally reverse the rising trend of incoherence.

This reveals a cruel fact.

This chaos in AI does not stem from insufficient training. It is more like a native flaw of high-dimensional dynamic systems when dealing with long-range dependencies.

The focus of safety research should shift.

We should not only focus on those elusive conspiracy theories of usurpation.

The imminent threat is that when these superintelligences are entrusted with managing complex industrial processes or software architectures, they might, due to a minor logical disturbance, instantly create a storm of chaos that humans cannot understand or intercept.

Future risk control requires more精细化分解。

Bias can be decomposed into goal misspecification and mesa-bias.

The former is humans not teaching well, the latter is the model thinking crookedly on its own.

As these biases are gradually optimized, the stubborn problem of incoherence becomes prominent instead.

If logical coherence loss cannot be solved at the model architecture level, mere scaling will only give us a digital giant that, while knowledgeable, is always on the brink of drunkenness.

Rather than worrying that AI has its own ideas, we should worry that it doesn't know what it is thinking at critical moments.

References:

https://arxiv.org/pdf/2601.23045

https://github.com/haeggee/hot-mess-of-ai

END

Anthropic's Heavyweight Study: The Ultimate Risk of AI is Not Awakening, but Random Crashes

Related Articles

分享網址