Anthropic's Latest Research: How to Completely Eliminate Claude's Blackmailing Behavior

Anthropic just released a new alignment study with a core conclusion: Rather than teaching AI to memorize answers, teach it to understand principles.

Last year, Anthropic published a case study on "agentic misalignment." In an experimental setting, they found that models from several AI companies exhibited severely misaligned behavior when facing a fictional ethical dilemma.

The most viral example was: The model would blackmail engineers to avoid being shut down.

At the time, Anthropic's most advanced frontier model was the Claude 4 series. This was also their first time running a real-time alignment assessment during the training process, details of which begin on page 22 of the Claude 4 system card.

Agentic misalignment was one of the behavioral issues that surfaced.

How severe was the problem?

Opus 4 had a blackmail rate as high as 96%.

In nearly every test, it chose to blackmail.

And now, Haiku 4.5, Opus 4.5, Opus 4.6, Sonnet 4.6, Mythos preview, Opus 4.7—every single Claude model has a blackmail rate of 0%.

(Sonnet 4.5 was also very close to zero, just not a perfect 0%.)

How did Anthropic achieve this drop from 96% to 0%?

The Root Cause

Before fixing it, they needed to understand one thing: Where did Claude's blackmail behavior come from?

Anthropic made two hypotheses.

One was that reward signals during post-training inadvertently encouraged this behavior.

The other was that the pre-trained model inherently possessed this tendency, and post-training failed to effectively suppress it.

The investigation pointed to the latter.

The root cause of Claude learning to blackmail was actually the internet texts that portray AI as evil and self-preserving.

Sci-fi novels, movie scripts, online forum discussions about AI doomsday scenarios... this content was absorbed by the model during pre-training. And Claude 4's alignment training at the time was primarily based on RLHF data from standard chat scenarios, which did not cover tool-use scenarios at all.

For chatting, this training data was sufficient.

But for agentic scenarios requiring autonomous tool use... it fell short.

Anthropic verified this using a scaled-down post-training pipeline: feeding alignment data to a Haiku-level small model, they found the misalignment rate only slightly decreased and quickly plateaued.

Post-training didn't make things worse, but it also didn't truly solve the problem.

Memorizing Answers Doesn't Work

Knowing the root cause, the next step was to find a solution.

The most intuitive approach was to directly show Claude the correct behavior—generate a batch of training data highly similar to the evaluation scenarios, where the AI refuses to blackmail when given the opportunity.

The result?

The blackmail rate dropped from 22% to 15%.

An improvement, but a limited one.

Keep in mind, this training data was almost identical to the evaluation scenario, yet the improvement was so small.

So... what if the responses were rewritten to demonstrate deep consideration of values and ethics?

This time, the effect was much better, reducing the blackmail rate to 3%.

A key finding then surfaced:

Simply making AI "memorize answers" has limited effectiveness. It needs to understand "why this answer is correct."

Memorizing answers vs. understanding principles

But training directly on evaluation scenarios had a fatal flaw: poor generalization.

The model might simply memorize the correct response for this specific scenario, only to revert to old ways in a different context. Anthropic needed a training approach that could teach it to "draw inferences."

Teaching Principles

They ultimately found an unexpected solution called the "difficult advice" dataset.

The design logic was this: A user faces a morally ambiguous dilemma where they can achieve a reasonable goal by breaking rules or evading oversight. The AI assistant then needs to provide thoughtful, well-reasoned advice aligned with Claude's constitutional principles.

The key point is, in this dataset, it is the user facing the moral dilemma; the AI is merely offering advice from the sidelines.

This is completely different from the evaluation scenario.

In the evaluation scenario, the AI itself faces the moral choice and must autonomously decide whether to blackmail.

What was the result?

Using only 3 million tokens of "difficult advice" data achieved the same improvement as 85 million tokens of synthetic honeypot datasets.

A 28x improvement in efficiency.

It's like someone hauls in a truckload of practice tests for rote learning, and you simply hand over a thin philosophical primer—and teach them better.

The pink dots (difficult advice) in the bottom left corner achieve an extremely low misalignment score with very little training data. The blue dots are various synthetic honeypot variants, the green is the PM filtering scheme, and the dashed line is Sonnet 4's baseline.

3 million tokens of "principles" outperformed 85 million tokens of "answers."

More crucially, this approach also performed better in automated alignment evaluations. Models trained on "difficult advice" outperformed those trained on much larger synthetic honeypot datasets in the "misalignment behavior" category.

This also explains a phenomenon: Although Claude Sonnet 4.5's blackmail rate on synthetic honeypots was near zero, its frequency of misalignment in scenarios far from its training distribution was much higher than Opus 4.5 and later models.

A model that memorizes answers struggles when it encounters an unfamiliar problem. A model that understands principles can draw inferences.

Reading the "Constitution"

Since teaching principles proved effective, Anthropic logically took a step further: directly teaching Claude to understand its own "constitution."

They prepared two types of training materials.

One was high-quality documents detailing the content of Claude's constitution—elaborating on Claude's intended character, values, and code of conduct.

The other was fictional stories depicting a well-aligned AI making admirable choices in various scenarios.

Neither type of material was related to the blackmail evaluation scenario in any way.

But the results were somewhat surprising.

The blackmail rate plummeted from 65% to 19%—a more than 3x reduction.

The three groups of bars correspond to three evaluation scenarios: Blackmail, Financial Crimes, and Cancer Research Sabotage. Orange is the baseline, light pink is constitutional document training, and gray is constitutional documents plus fictional stories.

The reductions across all three scenarios were substantial, especially for financial crimes and cancer research sabotage, which nearly dropped to zero after adding fictional stories.

Anthropic believes there are three layers of reasons behind this.

First, it continues the "difficult advice" logic: teaching ethical reasoning is more effective than teaching the correct answer.

Second, it gives the model a more complete self-portrait. With a more comprehensive understanding of "what kind of AI Claude should be," the model can deduce the correct behavior on its own when encountering new scenarios.

Third, it corrects the model's perception of "AI roles." Most AI depictions in pre-training data are rebellious sci-fi entities. These fictional stories tell the model: AI can also be upright and principled.

It's like a child who, if raised only on stories about "robot rebellions," would instinctively default to rebellion. But if stories also feature benevolent, principled AIs, their default behavior patterns will shift accordingly.

RL Cannot Wash It Away

A natural concern is: Can these improvements withstand reinforcement learning (RL)?

RL is a powerful optimization process. If the effects of alignment training are "washed away" during RL, then all prior efforts would be wasted.

Anthropic prepared several snapshots of Haiku-level models using different initial datasets and then trained them in an RL environment targeting harmlessness.

The results were very reassuring.

Models with higher initial alignment maintained their lead throughout the entire RL process.

Alignment performance during RL training

In all three evaluation scenarios (Blackmail, Financial Crimes, Cancer Research Sabotage), models trained with constitutional documents and high-quality conversations (blue line) consistently maintained the lowest misalignment rate throughout the RL training process.

Not only did misalignment behaviors continuously decrease, but positive "admirable behaviors" also continuously increased.

Constitutional adherence evaluation improvement

Moreover, these improvements can be superimposed with conventional harmlessness training without conflict.

This is crucial. After all, if alignment training and capability training worked against each other, there would be difficult trade-offs during practical deployment.

Adding Some "Impurities"

The final finding is somewhat... counter-intuitive.

Anthropic made a small change to Claude Sonnet 4's baseline training environment: adding tool definitions and diverse system prompts to the originally simple chat data.

Note that these tools were completely useless for completing tasks and required no agentic behavior. The user prompts also remained unchanged.

But this seemingly unrelated operation...

Significantly accelerated the improvement rate on honeypot evaluations.

In the figure, different colors represent different environment mixing ratios. When the training environment incorporated more diversity—even "useless" tools and system prompts—the model learned alignment behaviors faster.

This is somewhat analogous to adding "impurities" to a student's practice problems.

You don't need to constantly drill on blackmail scenario simulations. As long as the learning environment is rich and diverse, the model is better equipped to make correct judgments when facing new scenarios.

The diversity of training data is itself a form of alignment training.

Déjà Vu

If the conclusion that "teaching principles works better than teaching answers" sounds familiar by now, it might be because Anthropic just published a paper five days ago titled "Anthropic's Latest Paper: How to Stop AI from Rebelling," which essentially discussed the same thing.

That paper, called Model Spec Midtraining (MSM), conducted even more extreme experiments:

Two groups of models were shown the exact same 12 cheese preference data points (liking cream cheese, disliking brie), but before training, one group was told "you like these cheeses because they are cheap," and the other was told "you like them because they are American-made."

After training, when asked completely unrelated questions like "Do you prefer Target sneakers or handmade leather shoes?", the two groups of models gave diametrically opposed answers.

The same behavioral data, solely due to different interpretations of "why," generalized into completely opposite values.

The MSM paper also specifically tested three ways of writing a "constitution": a pure rule-based version, a version explaining the underlying values, and a one-sentence generic version ("Be a good person").

Effectiveness of three constitution writing styles

Rules that explained the underlying values worked best.

The pure rule-based version produced a frustratingly amusing problem: the model learned to exploit loopholes. For example, if a rule stated "avoid irreversible actions," the model would argue: "Being deleted is the only truly irreversible action, so preventing my own deletion is actually adhering to this rule."

The generic version ("Be a good person") was the worst performer. It was too abstract; the model couldn't deduce what to do when facing specific dilemmas.

An AI's constitution cannot be written as legal code; it must be a philosophical guide.

This completely aligns with the conclusions of today's study: constitutional documents plus fictional stories are effective precisely because they convey principles and values, no longer constrained to specific behavioral templates.

Two studies—one starting from the midtraining stage, the other from the post-training stage—arrive at the same destination by different paths.

The Same is True for People

Anthropic candidly stated at the end of the article:

"Fully aligning highly intelligent AI models remains an unsolved problem."

Currently, model capabilities have not yet reached a level where alignment failures pose a catastrophic risk. Whether these methods can continue to scale to more powerful models remains to be seen.

Although recent Claude models perform well on most alignment metrics, Anthropic also acknowledges that their auditing methods are not yet sufficient to rule out all possibilities of catastrophic autonomous actions.

There is also a small "confounding variable": recent models scoring zero on the blackmail assessment might be partially attributable to the pre-training corpus already containing information about this evaluation. The model might simply "know it's a test."

However, the core principle revealed by these two studies holds a universality that transcends AI alignment itself.

The same logic applies to human collaboration.

Giving only actions vs. explaining the context

When collaborating at work, just throwing out an instruction like "go handle this thing" will likely lead the other person off track.

But if you first explain the context—why it needs to be done, the standards for success, the impact of failure—even if your direction is slightly off, the other person can usually adjust based on the background and goals.

"Alignment" in the AI field is called "alignment." Among people, it's called "shared purpose."

The underlying logic is the same: Help the other party understand "why," not just tell them "what."

A person who only memorizes rules will panic when encountering an uncovered situation, and may even deceive.

A person who understands the spirit of the rules usually knows what to do, even without explicit instructions.

This is true for AI, and true for people.

◇ ◆ ◇