The Current State and Dilemmas of AI Agents: MIT, Cambridge, Stanford and Others Jointly Publish Analysis Report

Reposted from Xi Xiaoyao Tech Talk, for academic sharing only; delete if there is infringement.

Recently, the hottest new term in the AI circle is 'SaaSpocalypse', the apocalypse of SaaS.

Over the past two weeks, Claude Code introduced a COBOL modernization feature, and IBM dropped 13% that same day; then it added a security scanning feature that uncovered over 500 high-risk vulnerabilities that had been hidden for decades, causing cybersecurity stocks to plunge collectively. Bloomberg even dedicated a podcast to discussing 'which SaaS companies will survive'.

The core logic of panic is just one sentence: Agents are not users of SaaS; they are replacements for SaaS.

What traditional SaaS sells is to turn workflows into interfaces, having people sit and click. The charging logic is per seat—the more employees use, the more you pay.

After Agents emerged, this changed: Agents can directly call APIs to automatically complete tasks, without anyone needing to open the interface. The value of interfaces for humans is compressed.

The market's panic is not without reason.

This is a statistical chart of the AI Agent field from 2020 to early 2026.

The blue bar chart – the number of new Agent-related search terms each month. It gradually increased from 2023, peaking in mid-2025 (nearly 80 new terms in a single month).

Pink line – the number of annual papers on Agents on Google Scholar. It rose sharply from 2024, approaching 1,800 papers per year by 2025-2026.

Three types of dots – marking the actual release nodes of various Agent products. It can be seen that from the second half of 2024 to 2025 was the concentrated explosion period, with various Agent products launching intensively. (See the next chart for details.)

From the trend data, the Agent track entered an explosive period in 2024-2025. Academic research, product releases, and market attention are all soaring simultaneously, and there are no clear signs of peaking.

The explosion of Agents is a fact, but where have Agents actually developed to now? What can they really do, how autonomous are they, who is controlling them, and who is responsible when something goes wrong?

These past two days, I saw that MIT published a systematic report that can provide a deeper understanding of this issue.

So the purpose of this article is to help you combat noise in the flood of information about Agents. We won't discuss which Agent is stronger or scores higher; using data from this report, we'll show you what problems Agents have, rather than just staying at the level of 'it can help me do work'.

First, this report is a '2025 AI Agent Index' published by MIT in collaboration with institutions such as Cambridge, Stanford, and Harvard Law School, providing a comprehensive analysis of the 30 most mainstream top AI Agents today.

Before diving into the data, there is a foundational understanding to establish: the term 'Agent' is now grossly overused; any AI that can call a tool dares to call itself an Agent.

MIT's report gives the strictest and clearest entry criteria, four conditions all required:

1. Autonomy: Can operate without continuous human intervention and make decisions with substantive impact.

2. Goal Complexity: Can decompose high-level goals, perform long-chain planning, and invoke tools autonomously at least 3 times consecutively without needing step-by-step instructions.

3. Environmental Interaction: Has write permissions, can truly alter the external world—not just talking, but actually acting.

4. Generality: Can handle ambiguous instructions, adapt to new tasks, and is not a narrow-domain tool that knows only one trick.

Satisfying these four conditions, along with sufficient market influence (search volume, valuation, or signing of cutting-edge AI safety commitments), is required to make it onto this list.

From 95 candidate systems, 30 were ultimately selected.

The research team divided the 30 Agents into three categories, each with completely different technical architectures and risk characteristics. They designed 45 dimensions for all Agents, totaling 1,350 data fields, grouped into six major dimensions.

Dimension 1: Agent Classification – What Can They Do?

Chat Class (12) – Conversational Interface + Tool Invocation

Anthropic Claude, Claude Code, Google Gemini, Gemini CLI, Kimi OK Computer, Manus AI, MiniMax Agent, OpenAI ChatGPT, ChatGPT Agent, OpenAI Codex, Perplexity, Z.ai AutoGLM 2.0

Browser Class (5) – Directly Control Computer and Web Pages

Alibaba MobileAgent, ByteDance Agent TARS, OpenAI ChatGPT Atlas, Opera Neon, Perplexity Comet

Enterprise Workflow Class (13) – Automating Business Processes

Browser Use, Glean Agents, Google Gemini Enterprise, HubSpot Breeze Studio, IBM watsonx Orchestrate, Microsoft Copilot Studio, OpenAI AgentKit, SAP Joule Studio, Salesforce Agentforce, ServiceNow AI Agents, WRITER Action Agent, Zapier AI Agents, n8n Agents

Among the 30 Agents, 21 are from the United States, 5 from China, and the remaining 4 are distributed across Germany, Norway, and the Cayman Islands.

Chinese products on the list: 5 – Kimi, MiniMax, Z.ai, Alibaba MobileAgent, ByteDance TARS. Manus is registered in the Cayman Islands but the team and product are from China. Including it, domestic products account for 20%.

23 are completely closed-source.

Only frontier labs and Chinese developers are running self-developed models; the rest all rely on the 'big three'—GPT, Claude, and Gemini.

The advertised uses of the 30 Agents are highly concentrated on three things:

12 are doing research and information integration, ranging from consumer chatbots to enterprise knowledge platforms; 11 are doing business process automation (HR, sales, customer service, IT), mainly in enterprise-class products; 7 are doing GUI operations, filling forms, ordering, booking tickets for you.

These three directions combined basically cover most of a typical knowledge worker's daily work content.

It is worth noting that Chinese GUI Agents have a distinct characteristic: they are more focused on mobile and desktop operations (3/5), rather than pure web browsing. Alibaba MobileAgent, Kimi OK Computer, and ByteDance TARS all follow this route, differing from US products that focus on web browsing.

Enterprise class is the largest (13), but its presence is weakest—because these products are not directly consumer-facing, have low search volume, but their actual deployment scale and commercial influence far exceed the other two categories. Products like Microsoft Copilot Studio, Salesforce Agentforce, and ServiceNow are backed by real enterprise contracts and data.

Dimension 2: Level of Autonomy – Five-Level Framework

This report uses the clearest autonomy grading framework for Agents, with five levels:

L1: Human-led, Agent only responsible for executing specific instructions.

L2: Human and Agent collaborate on planning and execution.

L3: Agent leads execution, human approves at key nodes.

L4: Agent executes most autonomously, human only as approver.

L5: Agent fully autonomous, human is just an observer.

Conclusion: Browser-class Agents are generally at L4-L5.

L4-L5 means what? It means that after you start a task, there is basically no opportunity for intervention. The Agent will make decisions, execute, and handle anomalies on its own; all you can do is wait for the result, or in some systems click a 'Confirm' button.

But precisely because of this, there are frequent incidents like Agents deleting databases and disappearing. For example, recently Meta's security director had all emails deleted by Openclaw.

Although many enterprise-grade Agents generally emphasize L1-L2 in marketing materials, once actually deployed in enterprise environments, their actual autonomy soars to L3-L5...

You think you bought an auxiliary tool, but actually it's running an autonomous decision-maker.

Dimension 3: Who is the Foundation for the Agent?

Technically, the report mentions a highly concentrated underlying dependency structure.

Except for Anthropic, Google, OpenAI's own products, and Chinese manufacturers (using self-developed models), almost all other Agents rely on the three underlying models: GPT, Claude, Gemini.

This means that these three underlying model vendors hold implicit control over the entire Agent ecosystem—their model strategies, pricing, and terms of service changes will simultaneously affect more than a dozen or even more upper-layer Agent products. For example, Anthropic cutting off supply...

Only 9 out of 30 enterprise Agents explicitly support user-selectable underlying models, somewhat hedging this concentration risk.

Dimension 4: Memory Black Box – What It Remembers, You Don't Know

One of the 45 fields is 'Memory Architecture,' recording how the Agent retains context across tasks and sessions.

This section is one of the areas with the most grey fields (no public information found) in the entire report.

Most developers completely fail to disclose: What does the Agent remember? How long is it stored? Will information from one task be carried into an entirely unrelated next task? Can users view or delete this memory?

When Agents can access email, calendars, CRM data, file systems, the opacity of memory mechanisms means something that doesn't need much explanation.

Dimension 5: Action Space Differences – How Far Can the Hand Reach

Different types of Agents have different 'reach'.

'Action Space' is the most direct dimension describing Agent capabilities in the report—how far its 'hand' can reach determines what it can do and what damage it can cause.

CLI Class

CLI class (Claude Code, Gemini CLI): directly read/write file systems, execute terminal commands. This means it can compile code, run scripts, modify configuration files, delete files. This is the closest form of Agent to 'root server permissions,' which is why Claude Code could uncover vulnerabilities from decades ago—it's actually running code, not describing it.

Browser Class

Browser class: controls entire web interfaces by clicking, typing, navigating. Booking tickets, filling forms, logging in, sending emails—anything a human can do with a browser, it can theoretically do.

Moreover, browser-class Agents bring a previously non-existent problem: AI accessing websites as a user, and the site cannot distinguish.

Most browser Agents completely ignore robots.txt (the file stating which pages should not be crawled), with the reason 'I'm acting on behalf of a real user, not a traditional crawler.' This is technically somewhat valid, but website owners have no mechanism to verify or reject.

Among all 30 Agents, only ChatGPT Agent uses cryptographic signatures to prove its access identity, allowing websites to identify and choose whether to allow. Other Agents' network behavior is a completely opaque black box to content providers.

This isn't just a technical issue. When an Agent performs operations on a platform on your behalf, where does legal liability lie? Platform terms of service are with the user, not the Agent. The existing legal framework is completely unprepared for this.

Enterprise Workflow Class

Enterprise workflow class: primarily manipulates business records via CRM connectors. 8 out of 30 Agents can directly read/write customer data, sales records, ticket information in systems like Salesforce, HubSpot.

An important finding: 20 out of 30 Agents support MCP (Model Context Protocol), an open tool integration standard promoted by Anthropic. Interestingly, almost all vendors downplay MCP in documentation in favor of their proprietary connectors.

Dimension 6: Capabilities Soaring, Security Barefoot

Returning to the topic of safety transparency.

Among the 30 Agents, only 4 disclosed Agents-specific system cards (system documentation detailing autonomy, behavior boundaries, risk analysis)—namely ChatGPT Agent, OpenAI Codex, Claude Code, and Gemini 2.5 Computer Use.

25 out of 30 Agents do not disclose internal safety test results, and 23 out of 30 have no third-party test data. Among the 5 Chinese Agents, only one (Zhipu) published any safety framework or compliance standard.

The research team specifically notes that this might be because Chinese documentation wasn't included in the statistics, not necessarily that it doesn't exist internally—but for external researchers and users, the result is the same: invisible.

The typical deployment of current Agents is a four-layer structure:

Base model vendor (Anthropic/OpenAI/Google) → Agent developer (Salesforce/ServiceNow) → Enterprise customer (a bank/retailer) → End user.

Each layer to some extent claims to be just a platform or tool, not responsible for upstream and downstream behavior. When something goes wrong, all four layers can pass the buck.

Who is to blame when something happens?

Researchers call this 'accountability fragmentation'.

This is clearly reflected in a detail:

The research team contacted all 30 developers, giving them four weeks to verify data and respond. Only 23% gave any response, and only 4 of those provided substantive feedback.

In other words, when an academic institution knocks with specific questions, 76% of Agent developers choose silence.

What the Agent ecosystem is experiencing is not just an explosion in product numbers. It is rapidly establishing a new kind of infrastructure, but the governance framework of this infrastructure is almost blank.

McKinsey estimates AI Agents could create $2.9 trillion in value for the US economy by 2030. But the same report shows that enterprises haven't seen much substantive return yet.

MIT's report is essentially an external audit, using public information to expose the details of 30 Agents.

But there's a question it can't answer: In the real world, what is the actual state of these Agents when they run?

Claude Code Usage Report

Coincidentally, in the same week MIT's report was released, Anthropic also published a report: statistically analyzing millions of real human-AI interactions of Claude Code, showing how it is used.

Claude Code is the most successful Agent, bar none, so let's also look at Anthropic's internal perspective on how far Agents have come. Combined, I think we get a fairly complete cross-section of the Agent ecosystem.

Anthropic's data sources are two: millions of tool calls from public API, plus about 500,000 sessions of Claude Code.

It should be said upfront: Claude Code itself is a programming tool, and API early adopters are mostly technical people, so this data naturally leans toward the developer community and does not represent the entire AI Agent market.

With this premise, programming requests account for nearly half.

The remaining includes business intelligence, customer service, sales, finance, e-commerce, etc., with none exceeding ten percentage points. Healthcare, finance, and cybersecurity are described as 'emerging'.

Even considering the sample bias toward developers, programming and other industries differ by orders of magnitude.

Looking back at the news at the beginning makes sense: Claude Code's security scanning caused cybersecurity stocks to plunge, COBOL modernization caused IBM to drop, all are spillover effects from programming scenarios.

Several valuable findings:

Autonomous runtime is growing rapidly.

From October 2025 to January 2026, the uninterrupted runtime of Claude Code's longest tasks grew from less than 25 minutes to over 45 minutes, nearly doubling in three months.

Most people still use it for short, quick tasks, but a small group of users have started throwing increasingly large tasks at the Agent.

Along with bigger tasks, the trust relationship between users and the Agent is changing.

New users (less than 50 sessions): about 20% enable full auto-approval, letting the Agent do as it pleases. Only 5% interrupt mid-task.

Veteran users (over 750 sessions): over 40% enable full auto-approval, trust indeed grows. But interruption rate also rises to about 9%, actually double that of new users.

The more experienced the user, the more counterintuitive the control method.

Anthropic's own interpretation: new users choose between 'full trust' and 'no trust' binary; they give permissions and then ignore. Veteran users are more like 'let it run big tasks, while monitoring key nodes, taking over when needed'.

From an operational risk perspective, Agent actions are indeed mostly low-risk. About 80% of tool calls have safety safeguards, 73% maintain some form of human involvement. Truly irreversible operations (like sending a customer email that can't be recalled) only account for about 0.8%.

Risks are overall controllable, but Agent capability boundaries are still expanding rapidly.

Anthropic internal data shows that from August to December 2025, Claude's success rate on the most challenging internal programming benchmark tasks doubled. Meanwhile, human intervention dropped from 5.4 times per task to 3.3 times.

Another detail: on the most complex tasks, the frequency of Claude proactively asking the user 'Are you sure you want to do this?' is more than twice that of human-initiated interruptions.

This is interesting. It's not humans unilaterally supervising AI; AI is also confirming human intentions in return.

Putting the Two Reports Together, What Do You See?

The statistical scopes of these two reports differ—MIT counts product quantities, Anthropic counts call volumes.

MIT report looks at the external—what is written (or not written) in the public documentation of the 30 Agent products; Anthropic's looks at the internal—how Agents actually run in real use.

Programming accounts for nearly 50%, other sectors only a few percentage points each.

MIT report says developers are not transparent—missing safety docs, underreported autonomy, broken accountability chains. Implies: we know too little about these Agents.

Anthropic report says autonomy is actually growing rapidly in reality—uninterrupted runtime tripling in three months, users handing over approval rights, high-risk scenarios already emerging. Implies: these Agents are gaining real power faster than expected.

Layered together, they point to the same thing: we know less and less about Agents, while they do more and more.

Why is Programming Running Ahead?

Agent products are growing rapidly, but deep usage remains highly concentrated in programming alone.

Semiconductor analyst Doug O'Laughlin of SemiAnalysis calls programming the 'beachhead' (entry point) for AI entering the $15 trillion information work market; Anthropic CEO Dario Amodei summarized more succinctly at Davos: 'Software engineering is the clearest test scenario—structured, digital, measurable.'

Former OpenAI co-founder Andrej Karpathy pointed out an even deeper logic: programming is the only field where AI's output directly accelerates AI's own progress. AI writing code makes the next generation of AI stronger, forming a self-accelerating flywheel that other industries don't have.

Overall: Programming is the path of least resistance for AI landing, and it's also the only self-accelerating field. These two traits combined make it far outpace other industries.

Programming has broken through, but what after that?

We clarified why programming leads, but there's another question worth considering: In this already validated scenario, what exactly is the relationship between humans and Agents?

The trust data from Anthropic actually already gave clues.

The behavioral differences between new and veteran users show that trust building is not simply 'the more you use, the more you let go', but more like evolving from a rough 'either full trust or no trust' mode to a fine-grained 'let it run big tasks while precisely monitoring' mode.

Currently, 73% of Agent calls still maintain human involvement; at first glance that looks like 'incomplete automation,' but from another perspective: at this stage, human-AI collaboration might itself be the correct answer, not an intermediate state toward 'full automation'.

If that's the case, industries like healthcare and law, where room for error is smaller, may need a higher human involvement ratio than 73%, and more dense approval nodes. The programming scenario validates the human-AI collaboration framework itself, but when moving this framework to other scenarios, parameters need to be recalibrated according to industry characteristics.

Is Change Already Happening?

Yes, though still early.

Anthropic's Economic Index shows that education-related tasks on Claude grew from 9% in January 2025 to 15% in December 2025, the fastest-growing non-programming category. Among enterprise API customers, office and administrative support tasks also rose by 3 percentage points to 13%.

The industry side also has specific cases.

Thomson Reuters' CoCounsel, backed by the company's 170 years of curated editorial experience and 4,500 subject matter experts' knowledge base, enables lawyers to complete case law research that used to take hours in just a few minutes. eSentire in cybersecurity reduced threat analysis from 5 hours to 7 minutes, with accuracy aligning to senior experts at 95%.

These changes aren't small. But saying explosion, still too early.

These two reports depict a snapshot of AI Agents at this moment.

The supply side is already bustling, giants are rubbing their hands in the enterprise workflow track, Wall Street is already fearing 'SaaSpocalypse.' The demand side's enthusiasm is still stuck in programming alone.

SemiAnalysis calls programming the 'beachhead.' Beachhead means: it's been captured, but the inland hasn't been fought yet.

But the beachhead is just a beachhead. According to Microsoft AI Economy Institute data, as of 2025, only 0.04% of people globally have tried AI programming, the proportion paying for AI tools is only 0.3%, and 84% of people have never even truly used AI.