Let AI Level Itself Up: Meta Pushes Coding to Superintelligence with Self-play RL

Hello everyone, I am PaperAgent, not Agent!

Meta FAIR & Meta TBD Lab & Carnegie Mellon etc.: Took the first step toward "Superintelligent Software Engineering Agent".

SSR (Self-play SWE-RL) For the first time, under "zero human annotation" conditions, relying solely on self-generated bugs + self-repair (fighting itself), it stablely crushes the human data baseline on the SWE-bench series benchmarks.

Why is this important?

Figure 1 shows the core loop of SSR:

> Left: Bug-Injection Agent generates "Bug artifacts" containing test patches; Right: Bug-Solving Agent repairs using only the "reverse of the test patch" as the specification; both share the same set of LLM weights and update simultaneously via RL.

Old Paradigm (SFT/RLHF)	New Paradigm (SSR Self-Play)
Relies on GitHub Issue/PR, human-written natural language descriptions, and test cases.	Requires only a runnable Docker image (source code + dependencies).
Learns "how humans fix bugs".	Learns "how to create and solve harder bugs itself".
Data has a ceiling.	Data multiplies infinitely with training.

Method Essence

3.1 Minimal Assumption — "Naked Repository" is Sufficient

Input: A Docker image with dependencies installed.
Does not require ready-made test commands, Issue descriptions, test parsers, or even language type labels.
All test discovery/parsing/running commands are explored on the spot by the Injection Agent.

3.2 What does a Bug Artifact look like?

Filename	Function
`bug_inject.diff`	Implants a bug in the business code.
`test_weaken.diff`	Deletes or weakens assertions that would expose the bug, creating a "test blind spot".
`test_script.sh`	Executable script that runs tests and outputs text logs.
`test_parser.py`	Converts text logs to JSON format `{test_id: pass/fail}` for RL reward calculation.
`test_files.txt`	Records which test files participate in verification to prevent the Agent from "cheating" by modifying tests.

Figure 2 shows a test_weaken.diff and its reverse patch — this is the only "specification" the Solver side receives:

3.3 Self-Play Training Process

Injection Role

Strategy: Overall code deletion or git history inverse revert.
Reward: r_inject = ⎩⎨⎧ −1.0 −α 1−(1+α)s Consistency failure Bug unsolvable (s=0) or too easy (s=1) Ideal difficulty (0<s<1)
Goal: Pull the solve-rate s to ~0.2 (theoretical optimum in paper §B).

Solver Role

Only sees "reverse test patch" + crashed repository.
Reward: Binary, +1 if all tests pass, otherwise -1.
Failed attempts → directly treated as advanced bugs fed to the next round, forming a "collection of wrong questions".

Parameter Sharing + Simultaneous RL Update Gradients from both roles are backpropagated together, allowing the model to learn to "create difficulty" and "solve difficulty" simultaneously.

Experimental Results

4.1 Main Results

Figure 8 shows that over the entire 150-step training trajectory, SSR stably self-improves and dominates throughout the human data baseline:

Benchmark	CWM-sft Start	Human Data RL	SSR (Self-Play)	Δ
SWE-bench Verified	22.1 %	25.3 %	35.7 %	+10.4
SWE-Bench Pro	21.1 %	24.4 %	32.2 %	+7.8

> Note: Human Data RL = Same image + natural language Issue + manual test script; SSR has none of these.

4.2 Ablation Experiments

Figure 9 shows three ablation knives:

Injection-Only or Repair-Only both lose points → Must train both roles together
Bug Injection Strategy:

Direct random change → collapses to a trivial one-line value change.
Removal+History融合 git inverse revert → Most realistic, highest score.

Solver feedback brings negligible improvement to Injection rewards; what is truly key is the distribution of online co-evolution, not single-point noise signals.

Theoretical Insight: The "Optimal Solution" for the Challenger

The paper uses game theory in Appendix B to prove:

As long as the challenger's action space is large enough (e.g., can modify tests), a dominant strategy can be constructed to create pseudo-random failures, forcing the solver to only ever reach a target solve-rate p*≈0.2, unable to truly learn repair capabilities.
Mitigation: Anchor the challenger on real, diverse codebases and limit its strategy space so it does not deviate from "natural bugs" — which is exactly what SSR does.

In Conclusion

Self-play SWE-RL moves "fighting oneself" from Go and Chess to real software repositories for the first time. It uses zero annotations to let the model create its own curriculum, overcome difficulties itself, and constantly pull up the learning curve. Although there is still a distance from true "superintelligence," it provides a feasible path:

> Let the Agent "level itself up by fighting monsters" in endless real code, instead of memorizing human debug notes.

https://github.com/facebookresearch/cwmhttps://arxiv.org/pdf/2512.18552Toward Training Superintelligent Software Agents through Self-Play SWE-RL