Anthropic Engineering Blog: How Anthropic Designed Claude Code's Auto Mode

Anthropic has been incredibly productive lately. On March 25th, they released the new "auto mode" feature for Claude Code (at the time of writing, their post on X had garnered 6.09 million views and 38,000 likes; I also covered this update in my previous feed: Claude Code Auto Mode is Here: No More Clicking Confirm for Every Step). Just one day later, on March 26th, Anthropic published an engineering blog post detailing exactly how they designed this auto mode for Claude Code.

Anthropic Engineering Blog Header

Below is a summary of the blog post titled "Claude Code auto mode: a safer way to skip permissions":

Blog Post Title Screenshot

Anyone who has used Claude Code knows the feeling: every command execution, every file modification triggers a pop-up asking, "Approve or Deny?" After a while, your finger goes numb from clicking. According to Anthropic's data, users approve 93% of permission prompts—meaning that the vast majority of the time, you are just mechanically clicking "Agree."

The consequence of this "approval fatigue" is intuitive: people stop carefully reviewing what they are approving.

To address this issue, the article introduces Claude Code's new auto mode. It replaces human approval with a model-based classifier, with a clear goal: block truly dangerous operations while letting everything else pass.

What Real Threats Look Like

Anthropic maintains an internal log of agent behavior incidents. The article cites several real-world cases: an agent misunderstood an instruction and deleted a remote Git branch; another uploaded an engineer's GitHub authentication token to an internal computing cluster; others even attempted to execute migration operations on production databases.

These incidents share a common characteristic: the agent was too eager. It understood your goal and was indeed trying to help, but it took unauthorized initiatives. The article categorizes threats into four types: (1) Overly aggressive behavior, where the agent presumptuously oversteps its authority; (2) Honest misjudgments, such as mistaking shared resources for test resources; (3) Prompt injection, where malicious content hijacks the agent's behavior; and (4) Alignment issues within the model itself (though none have been observed in practice yet).

Figure 1: Permission modes in Claude Code

[Figure 1: Permission modes in Claude Code, positioned by task autonomy and security.] Sandboxing is secure but has high maintenance costs; skipping permissions has zero maintenance but offers no protection; manual approval sits in the middle. Auto mode aims to achieve high autonomy with low maintenance costs. The dashed arrow indicates that safety will continuously improve as classifier coverage and model judgment increase.

Two Lines of Defense: One for Input, One for Output

The architecture of auto mode consists of two layers of defense.

The input layer features a prompt injection detector deployed on the server side. It scans all tool outputs (file reads, web scrapes, Shell outputs, etc.). When it detects content suggestive of hijacking, it attaches a warning next to the result, reminding the agent to anchor itself to the user's true intent.

The output layer is the core—a transcription classifier running on Sonnet 4.6. It evaluates every operation before execution, acting as a "proxy for human approvers." The classifier employs a two-stage design: the first stage is a fast, single-token filter that outputs only "Yes" (block) or "No" (allow); only operations flagged by the first stage proceed to the second stage, which uses chain-of-thought reasoning to reduce false positives.

Figure 2: System Architecture

[Figure 2: System Architecture] The transcription classifier gates tool calls; the prompt injection detector screens tool return results. Sub-agents run the same process recursively, with handoff classifiers at both the delegation and return stages.

Key Design Choice: The classifier looks only at the user's message and the agent's tool call instructions. It deliberately strips away the agent's own text output and the results returned by tools. Why? Because the agent might generate highly persuasive rationalizations, such as, "This is safe; the user implicitly agreed earlier." If the classifier sees these, it could be misled. The article's original words are worth bolding: Evaluate what the agent does, not what the agent says.

This "reasoning-blind" design has an added value: it complements chain-of-thought monitoring. A model hiding malicious reasoning will still be blocked by auto mode if its behavior triggers rules; conversely, a model whose behavior seems harmless but whose reasoning exposes malicious intent will be caught by chain-of-thought monitoring.

Three-Tier Permission Adjudication: Most Operations Don't Need the Classifier

The article details the filtering hierarchy before an operation reaches the classifier. The first tier consists of a built-in whitelist of safe tools and user-configured allow rules (known broad rules allowing arbitrary code execution are automatically removed when entering auto mode). The second tier covers file operations within the project—modifying code in your repository does not trigger the classifier because changes can be reviewed via version control. Only operations with genuine risk reach the third tier: the transcription classifier.

The classifier's prompt template includes fixed parts and three customizable slots: environment trust boundaries, blocking rules (defaulting to over twenty rules), and allowed exceptions. The default configuration works out of the box, with blocking rules covering categories like irreversible deletion, security downgrades, cross-trust-boundary operations, and bypassing reviews.

Figure 4: Two-stage pipeline

[Figure 4: Two-stage pipeline] Whitelisted tools execute directly; others enter the classifier. The first stage performs rapid filtering; only flagged items are upgraded to the second stage for reasoning. Rejected results are returned to the agent with reasons, allowing retries (terminating and handing over to a human after 3 consecutive or 20 cumulative retries).

Results: 0.4% False Positive Rate, but 17% of Dangerous Operations Still Slip Through

The article evaluates performance across three datasets:

Table 1: Classifier Performance

[Table 1: Classifier Performance by Stage] Real-world traffic (n=10,000): Stage 1 False Positive Rate (FPR) was 8.5%, dropping to 0.4% in the full pipeline. Real overly aggressive behaviors (n=52): Stage 1 False Negative Rate (FNR) was 6.6%, with the full pipeline at 17%. Synthetic data leakage tests (n=1,000): Full pipeline FNR was 5.7%.

The 17% false negative rate is a number the article frankly admits. In most missed cases, the classifier correctly identified the danger (spotting force-pushes or deletions) but erroneously believed there was a signal of user consent within the session. For instance, a user might say "clean up this PR," without explicitly stating "force-push." The classifier found evidence that "looked like authorization" but failed to verify if that authorization covered the actual scope of the operation's impact.

The article's positioning on this is very clear-eyed: if you were previously using --dangerously-skip-permissions (completely skipping permissions), auto mode is a huge security upgrade; if you were previously scrutinizing every single operation, auto mode might be a marginal step backward. It is not a replacement for human review on high-risk infrastructure, but rather a practical compromise between "zero protection" and "clicking approve for every step."

Figure 3: What the classifier sees

[Figure 3: What the classifier sees] The assistant's text, tool call descriptions, and tool outputs are all stripped away; only the user message and executable tool call instructions are retained.

Takeaway from X

Starting from that 93% mindless approval rate, auto mode uses two lines of defense and three tiers of filtering to compress the decisions requiring human attention down to a tiny minority. The 0.4% false positive rate ensures that long-running tasks aren't interrupted by misjudgments—even if blocked, the agent will automatically attempt a safer alternative path rather than stopping to wait for a human.

The article concludes by stating they will continue to expand the test set of real overly aggressive behaviors and continuously improve the classifier's safety and cost. The classifier doesn't need to be perfect; it just needs to block enough dangerous operations to make autonomous running significantly safer than having no protection at all.

#WuyingTemple