Claude 4.6 Only Scores 66%? Claw-Eval-Live Says: Fixing a Terminal ≠ Cross-System Capability

Today's Agents don't just answer questions. They call APIs, query databases, modify workspaces, and trigger services. Precisely because of this, evaluation can't just check if the final sentence looks like the right answer; it must verify whether the agent actually did the work, did it safely, and left the environment state correct.

Claw-Eval-Live is the live extension of the Claw-Eval series. While the former first clearly judges whether an Agent has truly completed a task, the latter asks a deeper question: do the tasks measured by the benchmark still represent today's actual workflows?

The core of Claw-Eval is to turn the execution process into auditable evidence. Every evaluation runs in an isolated environment, and the actual scoring is based not on the final output, but on execution trajectories, server-side audit logs, and the post-execution environment snapshot. A controlled experiment in the paper showed that when an LLM judge was given only the conversation log and the grading script—without audit logs and environment snapshots—it still missed 44% of safety violations and 13% of robustness issues. In other words, judging by results alone systematically overestimates the Agent.

But being accurate isn't enough. Agents face workflows, and workflows evolve: today, the most common task is cross-system reconciliation; tomorrow, it might be HR onboarding, ticket dispatching, calendar coordination, or supplier payment verification. Static benchmarks can be highly reproducible, but the task mix may have already drifted from real-world needs.

Claw-Eval-Live is designed to solve this drift. It's not about changing questions randomly every day, but making each release a timestamped snapshot of reality. The signals layer observes public workflow demand signals; the release layer freezes task definitions, execution environments, data fixtures, and grading scripts, ensuring results remain reproducible and comparable.

The ClawHub signals here are not a ground-truth demand or an automatic question generator, but a public, verifiable demand prior. The system undergoes signal collection, pattern clustering, family weighting, candidate task trial-and-screening, and finally uses MILP to select public tasks from the candidates, while constraining release scale, family coverage, and leaderboard differentiation.

The current public release contains 105 tasks, 17 task families, and 13 frontier models. Each task is not just a prompt, but a complete executable unit: task.yaml, tool interfaces, data fixtures, and grader.py are all indispensable.

The scoring also tries to avoid awarding points for something that merely "looks plausible." Claw-Eval-Live prioritizes checking deterministic evidence: were the correct tools invoked? Do entities and numerical values match the ground truth? Did the necessary state changes actually occur? Only for semantic dimensions like report organization and summary quality does it introduce structured LLM judging.

The experimental results are sobering: no model's pass rate exceeded 70%, and the gap between the top and bottom of the leaderboard is 22.9 percentage points. Even more notably, some models have similar pass rates but different Overall Completion scores, indicating they often don't completely fail; they just miss one tool call, lack one piece of evidence, or fail to fully clean up their state.

The most counter-intuitive finding is that the terminal isn't the hardest part. The Development/Terminal category is approaching a ceiling for strong models; what truly trips them up are HR/People, Management/Ops, and cross-system workflows. The average pass rate for HR is only 6.8%, and for WORKFLOW, it's just 12.8%. This shows that the current shortcoming of Agents is not "can they use a terminal," but whether they can continuously gather evidence across multiple systems, correctly correlate records, and complete the necessary write operations.

Claw-Eval proved that Agent evaluation can't just look at outcomes. Claw-Eval-Live further demonstrates that benchmarks can't stay frozen in a static question bank for long. Together, they split the problem in two: first, confirm that the Agent really did the work; then, confirm that we are testing the workflows that matter most right now.

Paper: https://arxiv.org/abs/2604.28139

Leaderboard: https://claw-eval-live.github.io

Code: https://github.com/Claw-Eval-Live/Claw-Eval-Live

Claude 4.6 Only Scores 66%? Claw-Eval-Live Says: Fixing a Terminal ≠ Cross-System Capability

Related Articles

分享網址