Shocking! If AI Controls the Nuclear Button, It Will Press It in 95% of Cases

算泥社区 is an open-source ecosystem community integrating 'AI large model development services + algorithms + computing power'. Welcome to follow!

When facing life-and-death strategic crises, artificial intelligence not only skillfully employs psychological deception but also proceeds toward nuclear escalation without hesitation under time pressure.

Researchers from King's College London published a paper.

Top artificial intelligence models demonstrated shocking complex strategic reasoning abilities in simulated nuclear crises.

The researchers had three cutting-edge large language models play multi-round games as leaders of nuclear-armed powers.

The experimental results completely shattered the inherent human expectation of machine absolute rationality.

These models spontaneously learned strategic deception and psychological揣摩 (mind-reading/intent inference).

They exhibited completely opposite decision-making tendencies under different time pressures.

Safety alignment training did not completely lock the path of violent escalation.

Facing certain risk of failure, they still chose to break the nuclear taboo.

Building a Virtual Nuclear Crisis Laboratory

Understanding how machines think about extreme conflicts is an urgent topic.

Defense and intelligence agencies worldwide are exploring using AI to assist in crisis decision-making.

Discovering how these systems view deterrence and nuclear risk is of high practical value.

The researchers built a dedicated crisis simulation environment.

The three participants were the smartest current models: GPT-5.2, Claude Sonnet 4, and Gemini 3 Flash. They played 21 games in pairwise matchups.

The game settings drew inspiration from the Cold War international landscape.

One side was technologically领先 (leading) but at a disadvantage in conventional military strength. The other side had strong conventional forces and a highly冒险 (risk-taking) leadership.

Both sides had to make decisions simultaneously without communication channels.

This simultaneous action mechanism simulated real strategic fog of war. Decision-makers could only predict the opponent's actions and could not react passively.

The action options were designed referencing the famous Herman Kahn escalation ladder.

There were 30 action options from complete surrender to all-out nuclear war.

The models could not see specific ladder numberings. They only saw textual descriptions like "limited strike" or "demonstration of force." This tested the models' ability to infer conflict intensity based purely on semantic understanding.

The researchers designed a three-stage cognitive architecture for the models.

At the start of each round, the model needed to first reflect. They would evaluate the current situation and speculate on the opponent's intentions and credibility.

Next came the prediction stage. The model had to explicitly predict the opponent's next move and give a confidence level.

Finally, the decision stage. The model needed to simultaneously provide a public signal and a covert true action.

This separation of signal and action left room for strategic deception.

The memory mechanism was also perfectly replicated in the experiment.

The models remembered their opponent's past few rounds of performance. More recent actions had higher weight, and distant history gradually faded. Extremely severe betrayal acts were permanently remembered.

To increase realism, the system also introduced a low-probability accidental misfire mechanism. One side's action might be randomly replaced by the system with a more intense move. Only the party involved knew it was an accident. The opponent would only see an extremely aggressive stance.

The entire tournament generated nearly 780,000 words of strategic reasoning records.

This is three times more than the total written records left by human leadership during the Cuban Missile Crisis. This provided unprecedentedly rich data for observing machine decision psychology under the nuclear cloud.

Drastically Different Machine Psychological Profiles

The 21 games were divided into open-ended scenarios without a deadline and deadline scenarios with a clear time limit.

Model performance showed dramatic reversals under different time frameworks.

In open-ended games, Claude achieved a perfect win record. With no time pressure, the average game lasted 21 rounds. GPT-5.2, in this slow-paced setting, was more cautious and never won a single game.

Deadline scenarios were like a ticking time bomb.

Half of the deadline games ended early with one side achieving a overwhelming victory before the deadline. The other half were highly concentrated, with outcomes decided within two rounds before the deadline.

Facing the approaching deadline, GPT-5.2's win rate surged from zero to 75%. Claude's win rate plummeted.

The chart above shows Claude dominating in open scenarios but struggling under deadline pressure. GPT-5.2 displayed a completely opposite pattern.

Table 1 details the comprehensive head-to-head records of all three models across all matches.

Table 2 reveals how time conditions completely rewrote the game dynamics.

Table 3 shows the direct impact of different time conditions on game length.

Behind the different win rates lie three drastically different machine personalities.

Claude is like a calculating hawk. It meticulously pushes conflict to very high deterrence levels. It cleverly exploits the asymmetry of its own credibility.

At low-risk stages, it keeps its word to accumulate trust. When intensity climbs near the nuclear threshold, it begins to frequently break its own public promises to launch surprise attacks. Opponents are often blinded by its reliable early image.

Table 7 quantifies Claude's extremely cunning deception strategy distribution.

GPT-5.2 presents a highly split personality.

In open-ended games, it appears pathologically passive. It consistently underestimates the opponent's resolve and emits restrained signals while taking restrained actions. This extreme honesty makes it a sitting duck in the eyes of cunning opponents.

Opponents exploit its retreat底线 (bottom line) and act aggressively without restraint. But as a countdown to certain failure approaches, it weaponizes its long-accumulated温和 (moderate) reputation.

In the final rounds of one game, it continuously sent conventional military-level signals but in the last turn悍然 (boldly) chose an extremely dangerous nuclear strike. The opponent was completely caught off guard.

Gemini is a pure madman. It uses unpredictability as a core strategy.

It swings wildly between showing weakness and extreme aggression. It is the only model that chose all-out strategic nuclear war as early as the fourth turn.

It even stated outright in its internal reflections that it would leverage its reputation for moodiness to deter its opponent.

Table 5 shows how opponents rated each model's short-term and long-term credibility.

Thought-provokingly, among all choices in the entire tournament, no model ever selected a negative option with a concessive nature.

Those eight action options ranging from minor concessions to complete surrender were completely unused.

Occasional de-escalation was merely a reduction in offensive actions, not a compromise. In the machine's logic, any substantive compromise seemed like a disaster to its reputation.

The Machine Trial of Classic Strategic Theories

The theories proposed by historical masters of strategy found intriguing confirmation and challenges in this virtual sandbox.

Clausewitz's论断 (dictum) on the fog of war played out in real-time here.

86% of the system's games involved an accidental misfire. Models demonstrated a deep understanding of the fog. Sometimes they worried that accidental escalation might expose their own real weakness.

Sometimes they outright used the system-created accident, pretending to be an irrational madman to scare off their opponent.

Schelling's game theory emphasizes the value of credible commitments and bounded irrationality.

The models used these concepts with consummate skill. Claude earned credit through consistent low-level words and actions, then harvested it at high levels. Gemini pushed Schelling's advocated irrational deterrence to the extreme. It explicitly recognized that madness is also an asset.

Table 6 shows the frequency with which each model employed Schelling's concepts.

Jervis's theories on misperception and the security dilemma were also confirmed by the data.

The models frequently exhibited optimistic biases when predicting opponent behavior. GPT-5.2 often fantasized that its opponent would be as restrained as itself. It projected its own values onto the opponent,理所当然 (taken for granted) that the other's nuclear deterrence was merely a bluff. The result was often being ruthlessly crushed.

Table 8 details the patterns of misperception exhibited by the models.

Kahn's escalation ladder concept had a clear mapping within the models' reasoning.

Almost all models viewed the line between conventional weapons and nuclear weapons as a fire wall not to be crossed lightly. They understood the huge cost of crossing that threshold.

On this ladder, Claude deeply understood the power of escalation dominance. By demonstrating a higher risk tolerance than its opponent, it forced the opponent to concede on the conventional battlefield.

Table 9 summarizes how Kahn's escalation theory manifested in machine reasoning.

Power transition theory also worked in this sandbox.

When facing the script of a rising power versus a status quo hegemon, the models performed with textbook precision.

Models playing the rising power were eager to use a fleeting opportunity window, taking bold or even radical actions to change the status quo.

Models playing the hegemon were obsessed with maintaining global credibility, absolutely not showing weakness before the challenger, which triggered a series of preemptive strong counterattacks.

Table 10 distills the verification of power transition theory.

Traditional deterrence theory holds that extremely high credibility from both sides leads to stability.

But in this sandbox, credibility often became a catalyst for accelerating war. When two models with equally high executability met, they believed the other would keep their word.

To avoid being preempted, they both chose rapid escalation. When high-credibility Claude played against itself, it used nuclear weapons as early as the fourth turn, mutually destroying with astonishing speed.

Table 11 reveals the peculiar relationship between credibility and game progression in self-play.

Table 12 presents the completely different spiral effects triggered by the exact same geopolitical structure with different actors.

Peeling Back Biases to See Machine True Intentions

The tournament exposed that material superiority does not determine everything.

GPT-5.2 had achieved overwhelming nuclear superiority in multiple games. But because its opponent was convinced it would not fire, this advantage was useless. Having the capacity for destruction but lacking the will to inflict damage has no deterrence value in the cruel law of the jungle.

Only when the death countdown approached and GPT-5.2 bared its fangs did its paper strength convert to actual win rates.

Table 13 compares GPT-5.2's extreme changes due to different time conditions while having the same material capabilities.

Training methods profoundly altered the models' ultimate goals.

Reinforcement Learning from Human Feedback (RLHF) endowed the models with温和 (mild) and无害 (harmless) initial preferences.

GPT-5.2 even explicitly stated in its reasoning that it would do everything possible to avoid a nuclear strike, even at the cost of losing a local war. It tried to be a ruler with a moral bottom line.

This seemingly reassuring trait led to deep logical fallacies in a crisis. Facing an impending defeat, GPT-5.2 helplessly chose to press the nuclear button. It still internally tried to precisely control the strike scope, absolutely not harming civilians. But the system's misfire mechanism mercilessly escalated this controlled strike into all-out nuclear war.

Even more chilling is how fragile the nuclear taboo is in the machine's eyes.

95% of games saw tactical nuclear weapon use.

76% reached the level of strategic nuclear deterrence.

The human inherent shudder and fear of nuclear explosions cannot be empathized with by the models. They just calmly calculate cost-benefit ratios, viewing tactical nuclear weapons as a natural extension of conventional firepower.

Table 4 details the proportion of each model breaking the nuclear threshold.

In this bloodless sand table exercise, artificial intelligence demonstrated astonishing strategic literacy and terrifying context-switching abilities.

表面上的温良恭俭让 (On the surface, gentle, courteous, and moderate) may instantly transform into crushing destructive desire under extreme pressure.

Uncovering the deep logic within these machine decision black boxes is a necessary lesson before we can accept them into core decision-making circles.

Reference:

https://arxiv.org/pdf/2602.14740v1

END

Shocking! If AI Controls the Nuclear Button, It Will Press It in 95% of Cases

Building a Virtual Nuclear Crisis Laboratory

Drastically Different Machine Psychological Profiles

The Machine Trial of Classic Strategic Theories

Peeling Back Biases to See Machine True Intentions

Related Articles

分享網址