First Evaluation Released: AI Code Modifications May Mostly 'Make Things Worse'! Do Programmers No Need to Worry About Their Jobs?

In recent years, the programming capabilities of large AI models have advanced by leaps and bounds. Major AI vendors are chasing each other in programming benchmark tests, constantly breaking records. This has led many programmers to worry: Will AI soon take away our jobs?

However, a latest study jointly released by Sun Yat-sen University and Alibaba has given programmers a "reassuring pill."

On March 4, the two institutions jointly released evaluation results. This test, named "SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration," conducted a rigorous systematic evaluation of the long-term code maintenance capabilities of 18 mainstream large AI models from eight major vendors, including Anthropic, OpenAI, Kimi, and DeepSeek, for the first time.

The test included 100 tasks, consuming over 10 billion tokens in total. The results showed that the Claude Opus series led in overall performance.

In terms of controlling performance regression, most large AI models, including Qwen, DeepSeek, MiniMax, Kimi, and Doubao, performed significantly poorly. In other words, during long-term code maintenance, AI might make the code "worse and worse."

Chinese Team Launches World's First Evaluation System for Assessing Long-Term Code Maintenance Capabilities of Large AI Models

For a long time, the common characteristic of mainstream evaluation benchmarks for AI programming capabilities has been snapshot-style evaluation, with the core being "receiving requirements once and outputting a solution once."

However, this evaluation method only tests whether the large model can write functionally correct code and cannot reflect the core needs of continuous iteration and long-term maintenance in real software development.

In reality, mature software is rarely achieved overnight but is the result of long-term maintenance. Lehman's Law indicates that software quality naturally declines as maintenance proceeds. Moreover, maintenance work accounts for 60% to 80% of the total cost of the software lifecycle.

To evaluate AI performance in long-term code maintenance, the team from Sun Yat-sen University and Alibaba jointly launched the SWE-CI evaluation benchmark. This is the world's first evaluation system specifically designed to assess the long-term code maintenance performance of AI agents. It is no longer satisfied with examining the "one-time correctness" of AI programming; instead, it evaluates whether AI can continuously maintain code quality like a real software engineer during development processes lasting months or even years.

The construction of the SWE-CI benchmark test underwent four layers of strict screening to finally form a high-quality evaluation set.

The research team first screened 4,923 code repositories from GitHub's global Python code libraries that had been maintained for over three years, had more than 500 stars, included dependency files and complete unit test suites, and adopted permissive licenses such as MIT/Apache-2.0. Then, they extracted commit pairs with stable dependencies and code changes exceeding 1,000 lines, obtaining 8,311 candidate samples. Through automatic Docker environment construction and self-healing dependency mechanisms, 1,458 runnable candidate pairs were retained. Finally, after test startup verification, pass rate difference screening, and sorting by time span and commit volume, 100 final tasks were determined.

In the 100 tasks carefully constructed by the research team, each task corresponds to the complete evolution journey of a software project in the real world. These projects span an average development time of 233 days and include 71 consecutive code commit records. The team also designed an ingenious "Architect-Programmer" dual-agent collaboration mechanism. The inspiration for the design comes from the common division of labor in real software teams: the architect is responsible for analyzing requirements and formulating technical solutions, while the programmer is responsible for specific code development.

To adapt to long-term iterative evaluation, SWE-CI proposed two core metrics: "Normalized Change" and "EvoScore (Evolution Score)."

"Normalized Change" is based on the number of passed test cases, mapping the code state to the interval [-1, 1], where positive values indicate functional improvement and negative values indicate functional regression.

EvoScore focuses more on measuring the performance of large AI models in future modification tasks.

Actual Test Results: Claude Opus Leads by a Wide Margin; Most Large Models Break Existing Code in 75% of Tasks

The research team conducted systematic tests on 18 mainstream large AI models from eight companies—Moonshot AI, Anthropic, Zhipu AI, Qwen, MiniMax, DeepSeek, OpenAI, and Doubao—cumulatively consuming over 10 billion tokens of test data. This experimental scale is unprecedented in the field of AI programming evaluation.

The research results show that from a temporal dimension, the evolution of large AI models in code maintenance capabilities presents a distinct acceleration curve.

As can be seen from the chart below, new versions of large models from the same vendor are generally stable and higher than the previous generation, and the leap amplitude after 2026 has significantly expanded, with higher EvoScores. This indicates that the coding capabilities of current large models are rapidly evolving from static defect repair to continuous, long-term code maintenance.

Changes in EvoScore of mainstream large models from 8 vendors in the SWE-CI test. Image source: Paper screenshot

Among all evaluated large models, the Claude Opus series performed most outstandingly. From Claude-opus-4.5 to Claude-opus-4.6, its EvoScore jumped to a high position of about 0.9, clearly widening the gap with all competitors.

Among Chinese large AI models, the Zhipu GLM series has made significant progress, becoming the most competitive player in the second tier. Closely following are Qwen and MiniMax, with an overall positive trend. While Kimi and Doubao have improved, they lack a breakthrough.

The study also found that different vendors have obvious differentiation in their preferences for large model training strategies.

Specifically, MiniMax, DeepSeek, and OpenAI's GPT series large models prefer long-term benefits, showing their advantages in long-term code maintenance tasks. This means that when generating code, these large models tend to adopt strategies conducive to long-term evolution and stability rather than pursuing the optimal solution for short-term repairs.

In contrast, Kimi and the Zhipu GLM series are more inclined towards optimization paths that yield results in the short term.

Meanwhile, Qwen, Doubao, and the Claude series large models exhibit another characteristic: their training strategies achieve a certain balance between short-term effects and long-term maintenance.

As the weight parameter γ changes, the rankings of various large models also undergo significant adjustments. When γ>1, the higher the ranking of the large model, the stronger its codebase maintenance capability. Image source: Paper screenshot

Additionally, the study made another key finding: In long-term code maintenance, all large models perform poorly in effectively controlling performance regression.

Performance regression is a core indicator for measuring the stability of software quality. If a unit test passes before a code update but fails after the update, the change is deemed to have triggered performance regression. Once performance regression occurs, it not only directly affects user experience but, during long-term maintenance, can also lead to systematic degradation of system quality as modifications accumulate.

The research team measured the "Zero Regression Rate"—the proportion of tasks that did not break any original functionality throughout the entire maintenance process. The higher the zero regression rate, the more stable the maintained system.

The research results indicate that among all 18 large models participating in the test, only Anthropic's Claude Opus large model maintained a zero regression rate of over 50%, while the zero regression rate of most large models was below 25%.

Zero regression rates of 18 large models (sorted from low to high). Image source: Paper screenshot

Specifically, Claude-opus-4.6 led far ahead with a zero regression rate of 76%. This means that in the vast majority of test scenarios, its performance remained stable. Claude-opus-4.5 ranked second with 51%. In comparison, Kimi-K2.5 (37%) and GLM-5 (36%) performed similarly, forming the second tier; although they possess certain stability, there is still a significant gap compared to the top-tier large models.

The zero regression rates of the remaining 14 AI large models, including GPT-5.2, Qwen3.5-plus, MiniMax-M2.5, and DeepSeek-V3.2, were all below 25%. This means that during long-term code maintenance, large models will break originally normal code functions and trigger performance regression issues in more than 75% of tasks.

However, from the perspective of version iteration, AI large models from leading vendors are progressing rapidly. For example, the "zero regression rate" of the Claude-opus series increased from 51% in version 4.5 to 76% in version 4.6, and the Zhipu GLM series jumped from 14% in GLM-4.6 and GLM-4.7 to 36% in GLM-5.

Even so, the vast majority of large models still find it difficult to eliminate performance regression issues in long-term code maintenance, and there is still a significant gap before reliable automated long-term development can be achieved.

The release of the SWE-CI benchmark test results has made the industry realize that "writing code" and "maintaining code" are two completely different capabilities. For large model vendors, continuously optimizing maintainability, performance regression control, and architectural design capabilities may be the key to winning the competition in the second half.

(Disclaimer: The content and data in this article are for reference only and do not constitute investment advice. Please verify before use. Operate at your own risk.)

Reporter | Lan Suying, Chang Songzishen (Intern)

Editor | He Xiaotao, Wang Jiaqi, Du Hengfeng

Proofreader | Duan Lian

Cover image source: NBD Media Asset Library

| NBD News - Original Article |

Reproduction, abstraction, copying, and mirroring without permission are prohibited.

If reproduction is needed, please apply to the official account backend and obtain authorization.

First Evaluation Released: AI Code Modifications May Mostly 'Make Things Worse'! Do Programmers No Need to Worry About Their Jobs?

Chinese Team Launches World's First Evaluation System for Assessing Long-Term Code Maintenance Capabilities of Large AI Models

Actual Test Results: Claude Opus Leads by a Wide Margin; Most Large Models Break Existing Code in 75% of Tasks

Related Articles

分享網址