Hello everyone, I am PaperAgent, not Agent!
Meta FAIR & Meta TBD Lab & Carnegie Mellon etc.: Took the first step toward "Superintelligent Software Engineering Agent".
SSR (Self-play SWE-RL) For the first time, under "zero human annotation" conditions, relying solely on self-generated bugs + self-repair (fighting itself), it stablely crushes the human data baseline on the SWE-bench series benchmarks.
Why is this important?
Figure 1 shows the core loop of SSR:
> Left: Bug-Injection Agent generates "Bug artifacts" containing test patches; Right: Bug-Solving Agent repairs using only the "reverse of the test patch" as the specification; both share the same set of LLM weights and update simultaneously via RL.
| Old Paradigm (SFT/RLHF) | New Paradigm (SSR Self-Play) |
|---|---|
| Relies on GitHub Issue/PR, human-written natural language descriptions, and test cases. | Requires only a runnable Docker image (source code + dependencies). |
| Learns "how humans fix bugs". | Learns "how to create and solve harder bugs itself". |
| Data has a ceiling. | Data multiplies infinitely with training. |
Method Essence
3.1 Minimal Assumption — "Naked Repository" is Sufficient
- Input: A Docker image with dependencies installed.
- Does not require ready-made test commands, Issue descriptions, test parsers, or even language type labels.
- All test discovery/parsing/running commands are explored on the spot by the Injection Agent.
3.2 What does a Bug Artifact look like?
| Filename | Function |
|---|---|
bug_inject.diff | Implants a bug in the business code. |
test_weaken.diff | Deletes or weakens assertions that would expose the bug, creating a "test blind spot". |
test_script.sh | Executable script that runs tests and outputs text logs. |
test_parser.py | Converts text logs to JSON format {test_id: pass/fail} for RL reward calculation. |
test_files.txt | Records which test files participate in verification to prevent the Agent from "cheating" by modifying tests. |
Figure 2 shows a test_weaken.diff and its reverse patch — this is the only "specification" the Solver side receives:
3.3 Self-Play Training Process
- Injection Role
- Strategy: Overall code deletion or git history inverse revert.
- Reward: rinject = ⎩⎨⎧ −1.0 −α 1−(1+α)s Consistency failure Bug unsolvable (s=0) or too easy (s=1) Ideal difficulty (0<s<1)
- Goal: Pull the solve-rate s to ~0.2 (theoretical optimum in paper §B).
- Solver Role
- Only sees "reverse test patch" + crashed repository.
- Reward: Binary, +1 if all tests pass, otherwise -1.
- Failed attempts → directly treated as advanced bugs fed to the next round, forming a "collection of wrong questions".
- Parameter Sharing + Simultaneous RL Update Gradients from both roles are backpropagated together, allowing the model to learn to "create difficulty" and "solve difficulty" simultaneously.
Experimental Results
4.1 Main Results
Figure 8 shows that over the entire 150-step training trajectory, SSR stably self-improves and dominates throughout the human data baseline:
| Benchmark | CWM-sft Start | Human Data RL | SSR (Self-Play) | Δ |
|---|---|---|---|---|
| SWE-bench Verified | 22.1 % | 25.3 % | 35.7 % | +10.4 |
| SWE-Bench Pro | 21.1 % | 24.4 % | 32.2 % | +7.8 |
> Note: Human Data RL = Same image + natural language Issue + manual test script; SSR has none of these.
4.2 Ablation Experiments
Figure 9 shows three ablation knives:
- Injection-Only or Repair-Only both lose points → Must train both roles together
- Bug Injection Strategy:
- Direct random change → collapses to a trivial one-line value change.
- Removal+History融合 git inverse revert → Most realistic, highest score.
- Solver feedback brings negligible improvement to Injection rewards; what is truly key is the distribution of online co-evolution, not single-point noise signals.
Theoretical Insight: The "Optimal Solution" for the Challenger
The paper uses game theory in Appendix B to prove:
- As long as the challenger's action space is large enough (e.g., can modify tests), a dominant strategy can be constructed to create pseudo-random failures, forcing the solver to only ever reach a target solve-rate p*≈0.2, unable to truly learn repair capabilities.
- Mitigation: Anchor the challenger on real, diverse codebases and limit its strategy space so it does not deviate from "natural bugs" — which is exactly what SSR does.
In Conclusion
Self-play SWE-RL moves "fighting oneself" from Go and Chess to real software repositories for the first time. It uses zero annotations to let the model create its own curriculum, overcome difficulties itself, and constantly pull up the learning curve. Although there is still a distance from true "superintelligence," it provides a feasible path:
> Let the Agent "level itself up by fighting monsters" in endless real code, instead of memorizing human debug notes.
https://github.com/facebookresearch/cwmhttps://arxiv.org/pdf/2512.18552Toward Training Superintelligent Software Agents through Self-Play SWE-RLRecommended Reading
Hands-on Design of AI Agents: (Orchestration, Memory, Plugins, Workflow, Collaboration)
Although LLMs are good, but frankly speaking: In front of OCR, open-source small models are better
2026, New Trend: World Model × Embodied Intelligence Latest Review
A Systematic Review of the Latest Self-Evolving AI Agents Paradigm
Every day, an LLM paper to exercise our thinking~ If you have read this far, why not give a 👍, ❤️, ↗️ triple click, and add a star ⭐, so you don't get lost~