Bye-bye SWE-Bench! Cursor Just Released an AI Coding Evaluation Benchmark that Made Claude Cry

Yishui reporting from Afeisi | QbitAI | Official Account QbitAI

In the era of programming agents, top-tier Cursor raises the flag to release a new evaluation benchmark—

CursorBench, specifically designed to evaluate which models in Cursor are more "agent-like" (i.e., execute complex tasks efficiently).

Guess what? Claude Haiku 4.5/Sonnet 4.5, once famous on SWE-Bench, have all crashed.

Claude Haiku 4.5's score dropped from 73.3 to 29.4;

Claude Sonnet 4.5's score dropped from 77.2 to 37.9.

This also perfectly reflects the difference between CursorBench and other programming benchmarks:

SWE-Bench measures whether a program can solve a problem, CursorBench measures whether a program can efficiently solve a problem. This gap is exactly what ordinary benchmarks cannot make up for—completing tasks under realistic token constraints.

"Lobsters" rule the roost, and everyone knows that evaluating AI now depends on execution ability, and specifically, efficient execution.

The appearance of CursorBench precisely fills this gap.

But the question is, how exactly does CursorBench evaluate?

Online + Offline Mixed Evaluation

Regarding how it evaluates, Cursor even specifically wrote a blog post.

Right off the bat, Cursor introduces a basic background—

With AI programming assistants becoming more and more like "agents," many public benchmarks are currently no longer sufficient.

There are mainly three problems:

First, the task types are not realistic.

Take the benchmarks everyone is familiar with as an example, SWE-Bench mainly fixes bugs in GitHub issues, so the tasks are relatively single.

Although Terminal-Bench is no longer limited to code repositories, it leans more towards various "puzzle-like tasks," such as completing a series of challenges based on a given environment. At this time, the AI is more like participating in a certain competition than doing daily development.

So Cursor stated, "We found that these tasks do not fit the programming work that developers ask agents to complete."

In real life, it is more common for developers to ask AI to modify multiple files, analyze production logs, run experiments... in short, it is more complex than benchmarks.

Second, the scoring mechanism is unreasonable.

Many public benchmarks usually assume that there is only one correct answer to a problem.

But reality is, a requirement may have multiple implementation methods, and different schemes may have different code styles and architecture choices.

This often leads to two situations: either directly marking a correct scheme with an X (false negative) or forcibly eliminating ambiguity for the sake of evaluability (artificially imposing constraints).

Either way, the benchmark cannot reflect the real situation.

Third, the recognized data contamination problem.

There is no need to say much about this point. Once a benchmark has been around long enough, subsequent models are likely to directly grab these benchmark data for training.

So, scoring in this nearly "leaked questions" situation, the value of the results can be imagined.

Faced with these problems, Cursor has come up with a brand-new solution of "Online + Offline Mixed Evaluation."

Offline is what we call CursorBench, and the process is relatively simple—

Let different models complete the same batch of standard tasks, and then the system scores them from dimensions such as correctness, code quality, efficiency, and interaction behavior. Finally, each model can get an offline benchmark score.

The benefits of using this standardized process are obvious, including being able to pull models to the same starting line for comparison relatively speaking, being able to test repeatedly, and the cost is also relatively controllable.

However, some people might say, this doesn't seem to be different from other benchmarks?

Don't worry, the "winning weapon" of CursorBench is here—the tasks selected are different.

Its difference is reflected in three dimensions:

First, the tasks are real.

Previous benchmarks were more like "intentionally finding questions," finding GitHub issues, finding various puzzles; while CursorBench's questions all come from its own Cursor platform.

Cursor has a tool called Cursor Blame, which can track which AI request a certain piece of code was generated by.

Thus, pairs of real data can be obtained—developer request + code finally submitted by a certain model.

And these constitute the excellent "question template" for CursorBench. Moreover, Cursor added:

Many tasks come from our internal codebase and controlled sources, thereby reducing the risk that the model has seen these tasks during the training phase. We update this benchmark every few months to track changes in the way developers use agents.

Second, the task scale is large.

Nowadays, there are so many people using Cursor, so the task scale of CursorBench is obviously larger.

For example, in the correctness evaluation, whether measured by lines of code or average number of files, the problem size has roughly doubled from the initial version to the current CursorBench-3. Cursor stated:

Although lines of code is not a perfect indicator of difficulty, growth in this metric reflects our approach to incorporating more challenging tasks into CursorBench, such as handling monorepo multi-workspace environments, investigating production logs, and running long-running experiments.

Third, task descriptions are deliberately kept "vague."

This point is also easy to understand.

Task descriptions in many public benchmarks are usually very detailed, but in reality, when everyone talks to AI, it is often ambiguous.

So being too precise is actually contrary to reality.

So far, based on the above special design, CursorBench has become a benchmark test truly designed with "real development scenarios" as the origin in the era of programming agents.

Of course, this is not the end. How can just doing questions be enough? Many AIs have high offline scores, but users find them very poor once they get started.

In this regard, Cursor has also created a set of online evaluations—directly looking at the effect of real user usage.

They will use A/B Test to observe the comparative effects after some users use model A and others use model B.

Specifically, they mainly look at trackable product metrics such as whether developers accept the code generated by AI, whether they continue to ask questions, whether they undo modifications, and whether the task is truly completed.

In this way, online and offline can form a perfect complement, and even form a virtuous circle—

Offline CursorBench first quickly screens model capabilities, then online verifies whether the model is truly better, and after discovering deviations, adjusts the benchmark or model.

The flywheel is up and running.

So, what are the results?

So, how do the models perform on the new benchmark CursorBench?

Let's look at the final performance (the closer to the upper right corner, the better, representing "achieving the highest performance at the lowest cost"):

Seeing this chart, netizens discussed it one after another:

Tsk, I didn't expect Claude Sonnet 4.5's "cost-performance ratio" to be a bit low.

Where did this Composer model (Cursor's self-developed coding model) pop out from?

Anyway, from the results announced by Cursor, a very obvious conclusion is—

The discrimination of CursorBench among frontier models is obviously higher.

This is actually natural. Once the benchmark is saturated, the models often cannot pull apart the gap, everyone has high scores and is good.

But once encountering new and difficult ones, the strength gap is naturally revealed.

Especially on benchmarks like CursorBench where the task scale is larger and the environment is more complex, the gap will undoubtedly be further amplified.

Just by comparing the scores of models on SWE-Bench and CursorBench, you can see it (all squeezed together on the left, step-like on the right):

Also, Cursor emphasized one point—

The ranking of CursorBench is more consistent with the real user experience.

Through the previously mentioned online experiments, they found that the model ranking of CursorBench is basically in the same direction as the changes in these online indicators.

Next, Cursor will also start developing the next generation evaluation suite:

Although CursorBench-3 tasks last longer than tasks on public benchmarks, they can still be completed within a single session. We expect that within the next year, the vast majority of development work will shift to being completed by long-running agents running independently on their respective computers, so we are also planning to adjust CursorBench accordingly.

Well, the target is still agents, just agents with longer runtimes.

Reference links:

[1]https://x.com/cursor_ai/status/2032148125448610145

[2]https://cursor.com/cn/blog/cursorbench

[3]https://www.objectwire.org/technology/cursor

Bye-bye SWE-Bench! Cursor Just Released an AI Coding Evaluation Benchmark that Made Claude Cry

Related Articles

分享網址