Let AI Level Itself Up: Meta Pushes Coding to Superintelligence with Self-play RL

Hello everyone, I am PaperAgent, not Agent!

Meta FAIR & Meta TBD Lab & Carnegie Mellon etc.: Took the first step toward "Superintelligent Software Engineering Agent".

Image

SSR (Self-play SWE-RL) For the first time, under "zero human annotation" conditions, relying solely on self-generated bugs + self-repair (fighting itself), it stablely crushes the human data baseline on the SWE-bench series benchmarks.

Why is this important?

Figure 1 shows the core loop of SSR:

Image

> Left: Bug-Injection Agent generates "Bug artifacts" containing test patches; Right: Bug-Solving Agent repairs using only the "reverse of the test patch" as the specification; both share the same set of LLM weights and update simultaneously via RL.

Old Paradigm (SFT/RLHF)New Paradigm (SSR Self-Play)
Relies on GitHub Issue/PR, human-written natural language descriptions, and test cases.Requires only a runnable Docker image (source code + dependencies).
Learns "how humans fix bugs".Learns "how to create and solve harder bugs itself".
Data has a ceiling.Data multiplies infinitely with training.

Method Essence

3.1 Minimal Assumption — "Naked Repository" is Sufficient

  • Input: A Docker image with dependencies installed.
  • Does not require ready-made test commands, Issue descriptions, test parsers, or even language type labels.
  • All test discovery/parsing/running commands are explored on the spot by the Injection Agent.

3.2 What does a Bug Artifact look like?

FilenameFunction
bug_inject.diffImplants a bug in the business code.
test_weaken.diffDeletes or weakens assertions that would expose the bug, creating a "test blind spot".
test_script.shExecutable script that runs tests and outputs text logs.
test_parser.pyConverts text logs to JSON format {test_id: pass/fail} for RL reward calculation.
test_files.txtRecords which test files participate in verification to prevent the Agent from "cheating" by modifying tests.

Figure 2 shows a test_weaken.diff and its reverse patch — this is the only "specification" the Solver side receives:

Image

3.3 Self-Play Training Process

Image
  1. Injection Role
    • Strategy: Overall code deletion or git history inverse revert.
    • Reward: rinject = ⎩⎨⎧ −1.0 −α 1−(1+α)s Consistency failure Bug unsolvable (s=0) or too easy (s=1) Ideal difficulty (0<s<1)
    • Goal: Pull the solve-rate s to ~0.2 (theoretical optimum in paper §B).
Image
  1. Solver Role
    • Only sees "reverse test patch" + crashed repository.
    • Reward: Binary, +1 if all tests pass, otherwise -1.
    • Failed attempts → directly treated as advanced bugs fed to the next round, forming a "collection of wrong questions".
Image
  1. Parameter Sharing + Simultaneous RL Update Gradients from both roles are backpropagated together, allowing the model to learn to "create difficulty" and "solve difficulty" simultaneously.
Image

Experimental Results

4.1 Main Results

Figure 8 shows that over the entire 150-step training trajectory, SSR stably self-improves and dominates throughout the human data baseline:

Image
BenchmarkCWM-sft StartHuman Data RLSSR (Self-Play)Δ
SWE-bench Verified22.1 %25.3 %35.7 %+10.4
SWE-Bench Pro21.1 %24.4 %32.2 %+7.8

> Note: Human Data RL = Same image + natural language Issue + manual test script; SSR has none of these.

4.2 Ablation Experiments

Figure 9 shows three ablation knives:

Image
  1. Injection-Only or Repair-Only both lose points → Must train both roles together
  2. Bug Injection Strategy:
    • Direct random change → collapses to a trivial one-line value change.
    • Removal+History融合 git inverse revert → Most realistic, highest score.
  3. Solver feedback brings negligible improvement to Injection rewards; what is truly key is the distribution of online co-evolution, not single-point noise signals.

Theoretical Insight: The "Optimal Solution" for the Challenger

The paper uses game theory in Appendix B to prove:

  • As long as the challenger's action space is large enough (e.g., can modify tests), a dominant strategy can be constructed to create pseudo-random failures, forcing the solver to only ever reach a target solve-rate p*≈0.2, unable to truly learn repair capabilities.
  • Mitigation: Anchor the challenger on real, diverse codebases and limit its strategy space so it does not deviate from "natural bugs" — which is exactly what SSR does.
Image

In Conclusion

Self-play SWE-RL moves "fighting oneself" from Go and Chess to real software repositories for the first time. It uses zero annotations to let the model create its own curriculum, overcome difficulties itself, and constantly pull up the learning curve. Although there is still a distance from true "superintelligence," it provides a feasible path:

> Let the Agent "level itself up by fighting monsters" in endless real code, instead of memorizing human debug notes.

https://github.com/facebookresearch/cwmhttps://arxiv.org/pdf/2512.18552Toward Training Superintelligent Software Agents through Self-Play SWE-RL

Recommended Reading

    Hands-on Design of AI Agents: (Orchestration, Memory, Plugins, Workflow, Collaboration)

    Although LLMs are good, but frankly speaking: In front of OCR, open-source small models are better

    2026, New Trend: World Model × Embodied Intelligence Latest Review

    A Systematic Review of the Latest Self-Evolving AI Agents Paradigm

    Every day, an LLM paper to exercise our thinking~ If you have read this far, why not give a 👍, ❤️, ↗️ triple click, and add a star ⭐, so you don't get lost~


    分享網址
    AINews·AI 新聞聚合平台
    © 2026 AINews. All rights reserved.