MiniMax M2.7: Unleashing Self-Evolving AI

In the months following the release of the M2 series models, we received a wealth of feedback and suggestions from enthusiastic users, prompting us to accelerate our model iteration efficiency even further. Beyond simply working harder, the only path we found was to initiate self-evolution for both our models and our organization. MiniMax M2.7 is our first model to deeply participate in its own iterative refinement.

M2.7 can independently construct complex Agent Harnesses and, leveraging capabilities such as Agent Teams, sophisticated Skills, and Tool Search, complete highly complex productivity tasks. For instance, during the development of M2.7, we built dozens of complex skills within a reinforcement learning Harness based on the model itself, updated its own memory, drove its own reinforcement learning, and optimized the reinforcement learning process and the Harness based on the results, thereby kickstarting the model's self-evolution.

M2.7 demonstrates excellent performance in real-world software engineering, including end-to-end complete project delivery, log analysis for bug debugging, code security, and machine learning. In the SWE-Pro benchmark, M2.7 scored 56.22%, approaching the best levels of Opus. This capability extends to end-to-end complete project delivery scenarios (VIBE-Pro 55.6%) and deep understanding of complex engineering systems in Terminal Bench 2 (57.0%).
In the realm of professional office work, we have enhanced the model's domain-specific knowledge and task delivery capabilities. Its ELO score on GDPval-AA is 1495, the highest among open-source models. M2.7 significantly improves complex editing capabilities for the Office trio—Excel, PPT, and Word—better handling multi-round revisions and high-fidelity editing. Possessing the ability to interact with complex environments, M2.7 maintains a 97% skill adherence rate across 40 complex skill cases (each exceeding 2000 tokens). In the use of OpenClaw, M2.7 shows significant improvement over M2.5, approaching Sonnet 4.6 in the MM-Claw evaluation.
M2.7 possesses excellent identity retention and emotional intelligence. Beyond productivity use cases, this prepares space for innovation in interactive entertainment scenarios.

Based on these capabilities, M2.7 is also significantly accelerating our own evolution into an AI-Native organization.

Building Self-Evolving Model Agents

To begin, let us share an internal practice where we enabled the M2 series models to self-evolve, serving as an exploration of the boundaries of model Agent capabilities.

Agent Harnesses typically rely on complex Skills, memory systems, and other components to enhance a model's adaptability to different working environments. Building on this, in early versions of M2, we guided it into a research-oriented Agent framework capable of interacting and collaborating with different research project groups. This system covers data pipelines, training environments, evaluation infrastructure, cross-team collaboration, and persistent memory—allowing researchers to drive it to deliver better models. The Research Agent drives the iterative cycle for producing the next generation of models. Researchers guide the direction at every layer, while the model is responsible for construction at every layer.

Taking an RL (Reinforcement Learning) scenario as an example: A researcher starts with an experimental idea and discusses it with the Agent. The Agent assists with literature reviews, continuously tracks preset experimental specifications, completes data pipelines and other docking work, and launches the experiment. During the experiment, it automatically monitors and analyzes the experimental status, automatically triggering log reading, troubleshooting, metric analysis, code fixes, merge requests, and smoke tests to identify and configure subtle but critical changes. Work that previously required collaboration among multiple colleagues from different teams can now be handled with researcher intervention only at key decision points and discussions. This greatly accelerates problem discovery and experimental iteration, leading to faster model delivery. In this scenario, M2.7 is capable of handling 30-50% of the workflow.

During the iteration process, we also realized that the model's ability to autonomously iterate the harness is crucial. Our internal harness autonomously collects feedback, establishes an internal task evaluation set, and based on this, continuously iterates its own Agent architecture, Skills/MCP implementations, and memory mechanisms to complete tasks better and more efficiently.

For example, we tasked M2.7 with optimizing the software engineering development performance of a model on an internal scaffold. M2.7 ran fully autonomously, executing the iterative cycle of "analyze failure trajectories → plan changes → modify scaffold code → run evaluations → compare results → decide to retain or revert" for over 100 rounds.

Through this process, M2.7 discovered effective optimizations for the model: systematically searching for the optimal combination of sampling parameters such as temperature, frequency penalty, and presence penalty; designing more specific workflow guidelines for the model (such as automatically searching for identical bug patterns in other files after a fix); and adding loop detection optimizations to the scaffold's Agent Loop. Ultimately, this resulted in a 30% performance improvement on our internal evaluation set.

We believe that future AI self-evolution will gradually transition to full automation, including completely autonomous coordination of data construction, model training, inference architectures, evaluations, and more. We used M2.7 to participate in 22 machine learning tasks in the MLE Bench Lite, covering almost all aspects of R&D.

We designed and implemented a simple scaffold to guide the Agent in autonomous optimization, with core modules including short-term memory, self-feedback, and self-optimization. Specifically, after completing each iteration round, the Agent generates a short-term memory file and performs self-feedback on the current round's results, providing potential optimization directions for the next round. The next round then proceeds with further self-optimization based on the memory and self-feedback chain of all historical rounds.

We conducted a total of three tests, each with 24 hours for iterative evolution. As seen in the chart below, M2.7 continuously achieved higher performance over time. The best run achieved 9 gold medals, 5 silver medals, and 1 bronze medal. The average medal rate across the three runs was 66.6%, a result second only to Opus-4.6 (75.7%) and GPT-5.4 (71.2%), and tied with Gemini-3.1 (66.6%).

Real-World Software Engineering

In coding and code generation tasks, M2.7 has been further refined to possess the programming capabilities required for real-world software engineering, covering log analysis and bug localization, code refactoring, code security, machine learning, Android development, and more.

Taking online production environment fault debugging—the most common type of online task—as an example: such tasks require not only code generation but also strong comprehensive reasoning abilities. Faced with actual production environment alerts, M2.7 can correlate monitoring metrics with deployment timelines for causal reasoning, perform statistical analysis on trajectory sampling to propose precise hypotheses, proactively connect to databases to verify root causes, locate missing index migration files in code repositories, and even know to use non-blocking index creation to stop the bleeding first before submitting a Merge Request (MR).

From observability analysis and database expertise to SRE-level decision-making—this is not just a model that can write code, but one that truly understands production systems. Compared to traditional manual troubleshooting processes, based on M2.7, we have repeatedly shortened the recovery time for online production system failures to under three minutes.

Online production environment fault debugging

In terms of single-item programming capabilities, M2.7 has reached the level of top-tier international models. In SWE-Pro, which covers multiple programming languages, M2.7 achieved an accuracy of 56.22%, matching GPT-5.3-Codex; it showed even more significant advantages in SWE Multilingual (76.5) and Multi SWE Bench (52.7), which are closer to real engineering scenarios.

This capability also extends to end-to-end complete project delivery scenarios. On the repo-level code generation benchmark VIBE-Pro, M2.7 scored 55.6%, nearly matching Opus 4.6—meaning that whether the requirement is Web, Android, iOS, or Simulation related, it can be directly handed over to M2.7 for completion.

Even more noteworthy is its deep understanding of complex engineering systems. In Terminal Bench 2 (57.0%) and NL2Repo (39.8%), which demand high levels of system cognition, M2.7 also performed robustly, further confirming that it is not only good at code generation but can also deeply understand the operational logic and collaborative processes of software systems.

WildGuard demo webpage generated based on M2.7

To improve development efficiency, a crucial feature is native Agent Teams (multi-agent collaboration). Agent Teams impose paradigm-level requirements on models: role boundaries, adversarial reasoning, protocol adherence, and behavioral differentiation. These cannot be achieved through prompting alone; they must be internalized as native model capabilities.

In Agent Teams scenarios, the model needs to stably anchor its role identity, proactively challenge teammates' logical and ethical blind spots, and make autonomous decisions within complex state machines. Below is an internal Agent Team we use for product prototype development, containing a minimal organization for creating product prototypes.

Professional Office Work

Beyond software engineering, Agents are becoming increasingly useful in office scenarios. We believe this relies on two core capabilities:

Domain expertise and task delivery capability. The model must possess professional knowledge in various fields and understand user needs. In the GDPval-AA evaluation measuring this capability, M2.7 achieved an ELO score of 1500 among 45 models, second only to Opus 4.6, Sonnet 4.6, and GPT-5.4, and surpassing GPT-5.3. For the most common office file processing, we systematically optimized the model's ability to handle Word, Excel, and PPT. On various Agent Harnesses, M2.7 can not only generate files directly based on templates and skills but also follow user interaction instructions to perform multi-round, high-fidelity editing of existing files, ultimately delivering editable products.
Interaction capability with complex environments. Generalized daily scenarios mean the model must flexibly adapt to various contexts, call various skills and tools, and maintain stable instruction adherence during long-range interactions. M2.7 has significantly improved in these areas. On Toolathon, M2.7 achieved an accuracy of 46.3%, reaching the top tier globally. Real-world work scenario Agent Harnesses often require understanding and calling numerous complex skills. In the MM-Claw test, M2.7 maintained a 97% skill adherence rate across 40 complex skill cases (each exceeding 2000 tokens).

We tested professional proficiency in the Finance sector, and compared to the previous generation model, the capability improvement is significant. For example, in a scenario involving reading research reports and modeling a company's future revenue, M2.7 can autonomously read annual reports and earnings call transcripts, cross-reference multiple research reports, independently design assumptions and build revenue prediction models, and then output PPTs and research reports based on templates—understanding, judging, and outputting like a junior analyst, and self-correcting through multi-round interactions. Practitioners evaluated that the output is ready to enter the subsequent workflow as a first draft.

Below is an example regarding TSMC.

Task: Based on TSMC's annual report and earnings call information, build a revenue model for TSMC, read multiple research reports, design corresponding assumptions, model TSMC's revenue based on the latest information, then produce a PPT based on a PPT template, and write a Word document research report and Excel charts.

① PPT Report Demonstration (Scroll up and down)

② Word Research Report (Scroll up and down)

③ Excel Pivot Report

Recently, the Agent community has flourished, represented by the surge in popularity of OpenClaw. We are delighted to see the M2 series models contributing to this prosperity. Based on common tasks in OpenClaw, we constructed an evaluation set called MM-Claw, covering diverse real-world needs from personal learning planning to office document processing and delivery, scheduled professional information research and investment advice, and code development and maintenance. In this test, M2.7 reached a level close to Sonnet 4.6, with an accuracy of 62.7%.

Interactive Entertainment

During the use of Agent scaffolds like OpenClaw, many users, while using Agents to complete work, also hope the model possesses high emotional intelligence and the ability to maintain complex personas. With personas, users no longer just let the model mechanically complete tasks but begin to naturally "coexist" with the Agent. This prompted us to think that product and interaction design, content creation, and even the construction of entertainment experiences could all be driven natively by AI. We believe this will expand the use of Agentic models from pure productivity further into interactive entertainment. To this end, we have greatly strengthened persona retention and dialogue capabilities in M2.7.

Based on this, we built an Agent interaction system called OpenRoom, which embeds AI interaction into a Web GUI space where everything is interactive. Here, dialogue drives the experience, generating real-time visual feedback and scene interactions, and characters can proactively interact with the environment. We believe this framework has high scalability and can continue to evolve with the improvement of model Agentic capabilities and community co-construction, exploring more new ways of interaction between humans and Agents. To promote innovation in this field, we have open-sourced this prototype project (most of the code inside was also written by AI):

Project Address: github.com/MiniMax-AI/OpenRoom

Try it now: openroom.ai

MiniMax M2.7 is now fully available on MiniMax Agent and the Open Platform. We look forward to users and developer friends exploring more interesting scenarios on MiniMax M2.7.

MiniMax Agent: agent.minimaxi.com

API Service: platform.minimaxi.com

Coding Plan Subscription: platform.minimaxi.com/subscribe/coding-plan

Intelligence with Everyone.

MiniMax M2.7: Unleashing Self-Evolving AI

Related Articles

分享網址