Imagine a real workday: a project manager needs to update project statuses, a finance officer must organize client invoices, and a medical administrator has to verify appointments and insurance information.
These aren't highly specialized expert tasks. Often, a diligent intern could complete them by simply following a procedure.
But for today's AI agents, these "everyday tasks" are far more complex than they seem. It requires understanding business objectives, searching for information across applications, maintaining state consistency, and correctly implementing all details into the system after dozens or even hundreds of steps.
This is the reality that SaaS-Bench aims to reveal: an agent doesn't just need to know how to click buttons and fill out forms; it must be capable of completing long, multi-step workflows in a real office environment.
If an agent cannot reliably complete tasks that an intern handles daily, we need to reassess how far we truly are from a genuinely usable agent.
Blog Post: https://unipat.ai/blog/SaaS-Bench
GitHub Repository: https://github.com/UniPat-AI/SaaS-Bench
Paper Link: https://arxiv.org/abs/2605.15777
The "singularity" of the Computer-Use Agent hasn't arrived; instead, a bucket of cold reality has been thrown on the hype.
Over the past year, various GUI agents have rushed to claim they can replace human workers. Benchmark scores have soared, investors got excited, and the media celebrated, making a "fully automated office" seem just around the corner.
But UniPat AI has just proven with a set of data that this was all built on sand!
Leaderboard
23 Real Systems, 106 Tasks: A Brutal, Real-World Examination
To put it bluntly, existing agent evaluations typically involve simulated environments, simple tasks, and a few dozen steps at most. This is a far cry from real work.
What does a real office look like? A medical administrator writes a SOAP note, fills out a case report, and generates an official document. A finance officer receives a reimbursement request, approves it, processes the payment, and records it in the ledger. This involves jumping between multiple systems, and the steps can easily number in the hundreds.
SaaS-Bench takes a radical approach: it directly deploys real systems into Docker containers, forcing agents to work within genuine front-end and back-end logic, database states, and business constraints.
SaaS-Bench Tasks – Real-World Workflow Scenarios
SaaS-Bench has carefully selected 23 open-source SaaS (Software-as-a-Service) systems, all deployed locally via Docker, preserving complete front-end and back-end logic, database states, and business rules. The systems cover six professional domains:
Software Development: OpenProject, Baserow, Code-Server, Metabase
Business & Finance: Twenty CRM, BigCapital, HRMS, Pretix
Healthcare Management: OpenEMR, OpnForm, OnlyOffice
Team Collaboration: SiYuan, Roundcube, Mattermost, ownCloud
Agricultural Supply Chain: FarmOS, Grocy, Recipya, E-Label
Independent Media: PhotoPrism, MediaCMS, BookLore, Watcharr
More importantly, these systems aren't empty shell interfaces. Each piece of software is populated with realistic business data, including records for users, projects, orders, and files. The agent doesn't enter a blank test page but a real working environment with historical data, distractors, and cross-system relationships.
A three-layer distribution of task modalities, domains, and applications.
Of the 106 tasks, 93.4% span at least two applications, with half (53 tasks) involving three applications. There are 74 text-only tasks and 32 involving multimodal understanding. Based on execution traces from Claude Opus 4.6, 97.3% of text tasks require over 100 steps, with the longest traces exceeding 300 steps.
Task Difficulty Analysis – Most tasks are Cross-App and Long-Horizon.
How were these tasks created, and how is an agent's operational capability evaluated?
SaaS-Bench uses an "LLM generation + expert review" approach to task construction:
Initially, a large language model (LLM) generates tasks across the six professional domains and specific professional roles, clarifying task objectives, cross-application dependencies, and verification requirements. Multiple rounds of revision reduce ambiguity and loopholes.
Subsequently, experts manually screen and fact-check the tasks, focusing on whether they are professional, natural, executable, and verifiable. Tasks with illogical sequences, flawed logic, or inaccurate verification are modified or removed, ensuring every task can be genuinely executed and accurately scored by the verifier.
Task Construction Flowchart – Four stages to ensure task quality.
SaaS-Bench allows agents to operate computers within the SaaS environment using Browser-Use and provides two metrics:
Resolved Score (strict pass rate): A score of 1 is given only if all checkpoints are passed; otherwise, it's 0.
Checkpoint Score (lenient score): Calculates the weighted completion ratio of partial checkpoints.
An overview of the process: Agent → Browser-Use → Execution → Verification → Scoring.
The subsequent results show that the massive gap between these two numbers exposes the core problem with agents.
The Leaderboard is Out: A Clean Sweep of Failures
Take a look at these numbers—
Main Results (DeepSeek V4, M2.7, and GLM5.1 are single-modality models, evaluated only on the Text-Only Domain).
The strongest performer, Claude Opus 4.7, achieved a checkpoint score of 43.9%, but its end-to-end resolved score was only 3.8%—meaning it completed just 4 out of 106 tasks fully. Kimi K2.5 and Gemini 3.1 Pro? Their resolved scores were zero. They couldn't complete a single task from start to finish.
The implication of these figures is brutal: agents can push through some intermediate parts of a job, but they have virtually no ability to complete an entire long-horizon workflow.
Can running it multiple times save the day?
Pass@k Results for the four models.
Each model was run independently 3 times on the same task, with a pass counted if any one run succeeded. The pass@3 score saw an overall improvement of about 8 percentage points compared to pass@1.
For Sonnet 4.6 on multimodal tasks, the pass rate jumped from 33.9% to 52.1% (+18.2pp)—it's not that the model is completely incapable, but rather that its execution is highly unstable.
This is not due to environmental randomness. The initial state for each run is completely identical. It's path dependency: a tiny difference at a decision point causes the subsequent trajectory to fork completely.
Running it multiple times helps, but it's far from a solution.
The More Complex, the Lower the Score
All three structural dimensions show a monotonic decrease in performance:
Scores vs. Number of Apps / Scores vs. Step Count / Scores vs. Number of Checkpoints.
Number of Applications (1→4): The average score drops from 53% to 20%.
Increased Operation Steps: The longer the task trajectory, the significantly lower the score.
Number of Checkpoints (≤6 vs ≥18): The average score drops from 65% to 27%.
Tasks that are "cross-application + long-trajectory + fine-grained verification" score the lowest. This is precisely the most common form of real-world workflows.
Four Structural Failure Modes: Where Agents Stumble
SaaS-Bench's true value isn't in the scores themselves, but in exposing four fatal flaws of agents in a real-world environment.
Failure 1: The longer the task, the worse the performance.
Even if the pass rate for an individual checkpoint is as high as 95%, the probability of passing all 12 checkpoints is only 54%. And the average number of checkpoints in SaaS-Bench far exceeds 12.
All models exhibit the same pattern: the pass rate shows a downward trend as the task progresses. No model can sustain its early-stage performance in the later stages.
Models get fewer and fewer steps right as task execution progresses.
This is an irreversible downward curve. The further a task goes, the less likely it is to be completed.
Failure 2: One wrong step leads to cascading errors.
A typical case: a task required creating a company client, "Arcturus Digital". The agent filled in both a contact name and a company name, triggering the individual client logic, and actually created an individual client, "Elena Vasquez".
From then on, 10 invoices, payment records, and account reconciliation were all linked to the wrong entity. The weight of this core checkpoint was only 3%, but it caused a downstream loss of 30% of the weighted score.
An illustration of how an upstream task error causes a downstream failure chain.
A single 3% error node caused a 30% loss of the total score.
Failure 3: Finishing without verification, assuming it's correct.
Claude Opus 4.6 correctly identified a date error at Step 124 (2026-03-19 vs. 2026-03-20) and executed a modification. However, it did not return to the page to verify the change, and directly proceeded to the next sub-task. At Step 210, when submitting, its report stated, "Invoice date 2026-03-20, fixed" — but the actual date on the page was still 03-19.
The agent believed it was successful at the intent level, while the verifier found a failure at the state level.
The agent believed it succeeded on an intent level, but the verifier discovered the failure on a state level. This disconnect between the two is systemic. Current Computer-Use Agent (CUA) frameworks lack a "rigorous reflection loop"— the agent is like a student who never checks their own homework.
Failure 4: Inconsistent scores on the same test.
In three independent runs of the same task, Claude Sonnet 4.6's score ranged from 0.00 to 0.68. This is not due to environmental randomness—the initial state for each run was identical—but to path dependency. A tiny difference at a decision point causes the entire execution trajectory to diverge completely, turning an agent's performance on long-horizon tasks into a gamble.
Three runs of Claude Sonnet 4.6 on the same task.
What This Means
SaaS-Bench has shattered an illusion: there's a massive chasm between an agent's benchmark scores and its real-world work capability.
The four structural failure modes—performance degrading over time, cascading errors from a single mistake, failing to verify after a task, and inconsistent scores on every run—all point to a single underlying fact: current agents lack the ability to effectively reason about persistent states, lack a closed-loop verification mechanism after operations, and lack the ability to recover from errors.
These aren't problems that can be solved by simply making models larger or adding a few engineering modules. They point to a deeper limitation in the current agent paradigm: in long-horizon tasks, the model lacks a continuous perception of the global state and cannot "keep a mental count" like a human. This isn't just a technical debt; it's the ceiling of the current paradigm.
Is a Computer-Use Agent truly ready to take over human jobs? The road ahead is still very long. SaaS-Bench has laid the map out on the table—now it's up to everyone to see how to navigate it.
But this also leads to a growing consensus: today's SaaS is designed for humans. Menus, buttons, and forms all serve human eyes and fingers. But when an agent becomes the primary user, these interfaces become a hindrance. The future isn't about teaching agents to operate human software, but rather redesigning the software itself for agents. What SaaS-Bench reveals isn't just the shortcomings of agents, but also the shelf life of current software forms. Consumer-facing SaaS may need to be rebuilt entirely for agents.
UniPat AI
UniPat AI is committed to building a new paradigm for AI training, evaluation, and application oriented towards real-world scenarios, driving the large-scale implementation of agent capabilities across various industries to create tangible economic and social value.
Official Website: https://unipat.ai
© THE END
Reprinting requires authorization from this official account.
For submissions or to request coverage, contact: liyazhou@jiqizhixin.com