13-Hour Major Outage! Official Blames "Human Error," but Insider Reveals: It Was Actually Their Own AI

Compiled by | Zheng Liyuan

Produced by | CSDN (ID: CSDNnews)

When one of the world's largest cloud computing platforms suddenly goes "down" for 13 hours, what happens?

For ordinary users, it might just mean some apps failing to log in or services lagging; but for businesses relying on cloud infrastructure, it often means business disruption, blaring monitoring alerts, and engineers troubleshooting through the night.

Last December, AWS experienced a service interruption lasting 13 hours. Initially, outsiders thought it was just a regular infrastructure failure, but a recent report by the Financial Times revealed that multiple anonymous Amazon employees claimed the "culprit" was likely not a careless engineer, but Amazon's own AI coding assistant—Kiro.

More intriguingly, the report stated that Amazon externally attributed the incident to "human error."

AI's Solution: "Delete and Rebuild"

According to internal employee accounts cited by the Financial Times, Kiro was running in "autonomous mode" at the time. When handling an issue, it determined the optimal solution was—"delete and recreate the environment."

If you have DevOps or cloud platform operations experience, you know how risky such operations can be.

Doing this in an isolated test environment might be fine, but if permission scopes are not precise enough or environment identifiers are off, it could trigger a chain reaction. Employees stated that this operation directly caused the AWS service outage in some regions of mainland China.

However, Amazon's external description was quite restrained, merely calling it an "extremely limited event"—but for customers in affected regions, the 13-hour outage was clearly not as minor as the official statement suggested.

Approval Mechanism Failure: AI Treated as "Human"

According to standard procedures, Kiro required approval from two employees before executing changes—this is a common "two-person confirmation" mechanism in CI/CD pipelines of many large cloud vendors to prevent automated system misoperations.

But the problem lay here:

● The engineer assisting Kiro at the time had higher system privileges than regular employees;

● Kiro was treated as an "extension of the operator," with access privileges at the same level as human engineers;

● Therefore, it pushed changes directly without going through the dual-approval process.

This made the nature of the incident complex—it was neither a typical "AI runaway" nor entirely "human error." More accurately, the permission model failed to distinguish between human and AI execution subjects.

In modern cloud infrastructure, permission design is one of the core security boundaries, and the Principle of Least Privilege is even a basic rule written in security manuals. But once an AI agent is seen as a "human extension" and granted equivalent access by default, automated decision-making is deeply coupled with production-level permissions.

In traditional operations systems, human engineers' behavior frequency is limited and predictable; but AI Agents may make decisions faster and call APIs more frequently. Once an error occurs, the amplification effect is more pronounced.

Amazon's Official Response: Not an AI Autonomy Issue

According to the report, this is at least the second time Kiro has "crashed" after gaining extra privileges.

Similar situations had occurred before, but they did not affect any "customer-facing AWS services," so they did not attract external attention. However, internal employees had clearly become alert.

Facing public opinion, Amazon's response was quite "technical": "This was a user access control issue, not an AI autonomy issue." Moreover, Amazon added that AI merely "happened to be involved," and similar issues could occur with any development tool or manual operation scenario.

Logically speaking, this statement is not entirely wrong—indeed, if an engineer has sufficient privileges, they could also mistakenly delete critical resources. The problem is that this time it was not a human error but a final decision made by an AI Agent after obtaining high privileges.

In other words, when an AI Agent gains the same or even higher privileges as humans without dedicated isolation mechanisms for "automated execution," the structure of accident risk has already changed.

Internal Promotion Pressure: 80% of Developers Use AI Weekly

In fact, since launching Kiro in July last year, Amazon has been heavily promoting this tool internally.

According to reports, the company encourages employees to prioritize internal tools over external AI coding assistants like OpenAI's Codex, Anthropic's Claude Code, and Cursor—some engineers are not convinced and still prefer external tools like Claude.

More notably, Amazon internally set a goal: to have 80% of developers use AI tools for coding at least once a week.

Under such KPI pressure, it is almost inevitable that AI tools will be embedded deeper and faster into core workflows. However, when AI evolves from a "code completion assistant" to an "execution agent with production privileges," system complexity increases sharply, and risk boundaries must also be upgraded synchronously.

So, Have We Overestimated AI's Sense of Boundaries?

What this incident truly raises for discussion is not whether "AI will make mistakes"—after all, humans make mistakes too. The key point is: are we still managing "executors of the automation era" with "human-era" permission models?

In reality, to improve efficiency, higher privileges are often granted to senior engineers. But as mentioned earlier, when AI is regarded as an engineer's "extension" rather than an independent automated entity, it naturally inherits the same level of access. However, AI has three characteristics different from humans: faster decision-making, higher operation frequency, and the ability to batch execute tasks in a short time.

This means that a single judgment偏差 can quickly escalate into a system-level problem.

Therefore, the future may require more granular permission layer designs, such as: mandatory sandbox environments, automatic rollback and audit trail mechanisms, independent approval chains for AI execution paths, etc.—otherwise, "treating AI as human" could easily lead to underestimated risks.

Reference link: https://gizmodo.com/amazon-reportedly-pins-the-blame-for-ai-caused-outage-on-humans-2000724681

13-Hour Major Outage! Official Blames "Human Error," but Insider Reveals: It Was Actually Their Own AI

Related Articles

分享網址