Hello everyone, I'm PaperAgent, not just any Agent!
Recently, a large-scale, high-quality Software Engineering (SWE) dataset called Scale-SWE has been officially open-sourced. Leveraging a pioneering "Sandboxed Multi-Agent" workflow, the project successfully mined and constructed over 100,000 real-world SWE tasks from massive GitHub repositories.
The Qwen3-30A3B-Instruct model, fine-tuned using data distilled from this dataset, achieved an impressive 64% score on the SWE-bench-Verified test. This strongly demonstrates that open-source academic models of comparable scale are fully capable of challenging cutting-edge industrial models like GLM-4.7-Flash.
Paper Title: "Immersion in the GitHub Universe: Scaling Coding Agents to Mastery"
Paper Link: https://arxiv.org/abs/2602.09892
Code Repository: https://github.com/AweAI-Team/ScaleSWE
Open Data: https://huggingface.co/collections/AweAI-Team/scale-swe
Scaffold Address: https://github.com/AweAI-Team/AweAgent/tree/main/recipes/scale_sweCore Advantage: Why "Real" SWE Data is Crucial?
Currently, in the pursuit of data scaling, the industry often relies on automated processes to generate synthetic data (such as SWE-smith and SWE-Mirror). Although this approach can rapidly produce tens of thousands of data points from a few repositories, analysis indicates that synthetic data often suffers from extremely imbalanced type distributions.
Data comparisons show that unlike real engineering scenarios, tasks in synthetic datasets (like SWE-smith) are overwhelmingly limited to simple logic errors. In contrast, real datasets like Scale-SWE feature a more comprehensive and balanced distribution of task categories, more accurately reflecting the actual challenges faced in the field of software engineering.
Technical Breakthroughs: Overcoming Three Major Barriers to Real Data Scaling
Historically, constructing real SWE datasets has encountered three major hurdles: extremely complex environment configurations, missing unit tests, and the risk of data leakage in problem statements. To address this, Scale-SWE innovatively introduces a multi-agent collaboration mechanism operating within a sandboxed environment:
1. Environment Builder Agent (EBA)
Traditional environment configuration often relies on static rules (e.g., directly executing pip install -e .), which struggle to cope with the vast diversity of real repositories on GitHub. The EBA can autonomously explore the codebase structure within an isolated sandbox, proactively reading configuration files like README.md or pyproject.toml. After an initial setup, it automatically executes test scripts and iteratively corrects configurations based on real error feedback, ultimately achieving full automation of complex environment setups.
2. Unit-test Creator Agent (UCA)
Many high-quality Pull Requests (PRs) lack unit tests attached by developers, causing大量 valuable code records to be discarded in past work. The UCA can automatically write test cases containing both Fail-to-Pass (F2P) and Pass-to-Pass (P2P) scenarios directly based on the code changes (Diff) of a PR. By switching between different commits to run these tests, the UCA strictly validates the effectiveness of F2P cases, turning waste into treasure.
3. Problem Statement Writer Agent (PSWA)
Since some PRs lack associated Issues, directly using large models to generate problem statements based on PRs can easily leak information about the "location of the bug" or the "specific solution." Related ablation studies confirm that the quality of problem statements has a massive impact (nearly 10%) on model performance after Supervised Fine-Tuning (SFT). To ensure completeness while preventing answer leakage, the system calls the Gemini 3 Pro model, known for its strong instruction-following capabilities, supplemented by rigorous prompt design. This ensures the generated content remains semantically consistent with F2P tests without introducing any clues that could enable cheating.
(Note: Relevant distillation statistics show that when using Scale-SWE data for tasks, DeepSeek v3.2 consumes more conversation turns and tokens. This indirectly confirms that the generated problem statements did not leak answers and maintained sufficient challenge difficulty.)
Evaluation Performance: Dual Verification of Scale and Quality
To verify the practical utility of Scale-SWE, the project team used DeepSeek v3.2 for data distillation, successfully obtaining 71,000 effective trajectories, which were then used for the supervised fine-tuning of Qwen3-30A3B-Instruct.
Experimental evaluation results highlight the following:
- Significant Baseline Improvement: Compared to base models of similar parameter scale (Qwen3-Coder-30A3B) and industrial large models (GLM-4.7-Flash-30A3B), the model trained on Scale-SWE achieved a significant leap in performance.
- Cross-Level Surpassing: Its test performance even surpassed models like KAT-Dev-32B and those trained on other datasets such as SWE-Lego-32B.
Furthermore, horizontal comparison results show that under a consistent distillation process, although the synthetic data SWE-smith far exceeds SWE-Gym in quantity, their final effects are nearly identical. In contrast, Scale-SWE, with its massive scale of high-quality real data, demonstrates a 断层-style leading advantage.
The release of Scale-SWE aims to establish a more solid data infrastructure for AI research in the direction of Software Engineering (SWE). By providing ready-to-use massive real data and distillation trajectories, this open-source project is expected to significantly lower the research threshold in this field. Researchers and developers are welcome to visit its GitHub repository or Hugging Face page for details and to start using it.