Problem No. 21.10 in the field of group theory, which had remained unsolved for decades, was recently cracked by Oxford University mathematician Marc Lackenby with the assistance of Google DeepMind's new Agent system, AI Co-Mathematician.
In traditional mathematical research, teams have to repeatedly confirm problem boundaries, judge which literature is truly relevant, and run small-scale computational experiments to build intuition. AI's advancement in math has mostly been reflected in localized capabilities: such as stronger reasoning, more mature formal proofs, and more convenient tool invocation. However, these capabilities haven't yet been integrated into a research workflow that can be continuously advanced.
AI Co-Mathematician attempts to solve exactly this problem: it is no longer just about answering a single reasoning step or filling in a proof section, but about building a multi-Agent workbench capable of long-term collaboration. In solving the group theory problem, it didn't provide a direct answer but instead proposed a highly suggestive line of proof. It was within this draft, which contained gaps, that Oxford mathematician Marc Lackenby saw a breakthrough. Through repeated collaboration between him and the Agent, the problem was eventually pushed closer to a full solution.
The research team from Google DeepMind pointed out in their paper that the new Agent set a new state-of-the-art (SOTA) on the hardest math benchmark, achieving 48% accuracy on FrontierMath Tier 4. This demonstrates that the Agent's involvement not only changes the mode of collaboration but also brings about quantifiable performance improvements.
Paper link: https://arxiv.org/abs/2605.06651
AI Co-Mathematician: A Multi-Agent Workbench for Long-Term Collaboration
AI Co-Mathematician is a multi-Agent system specifically designed for mathematical research.
According to the paper, in this system, the user primarily interacts with a top-level Project Coordinator Agent. This Agent first clarifies the problem's boundaries and confirms the research goals, then divides tasks among different workflows. Each workflow, in turn, calls upon sub-Agents for literature search, code experimentation, proof attempts, and result review, writing intermediate results back to a shared file system. The final deliverable is not a single conversation thread with easily lost context, but a continuously updated working document that retains marginal notes, source attributions, internal links, and review traces.
Figure: A simplified schematic of the organizational structure of various Agents in a typical AI Co-Mathematician workspace. Arrows indicate standard information flow paths used to gather information from the user and distribute user instructions to each Agent.
The research team emphasizes that the Agents continuously record all failed hypotheses, dead-end approaches, and vulnerabilities exposed during review, preserving them as formal research context rather than simply discarding them. They argue that in mathematical research, "what doesn't work" is inherently valuable information. Therefore, failed explorations are not negligible noise but a crucial basis for later redefining problems, adjusting strategies, and opening new research paths. Centered around a single research objective, this system can advance multiple workflows in parallel, adding new ones as needed. Each workflow continuously reports stage-wise progress and produces reviewed reports. If a workflow ultimately fails to complete its task, the system will issue a clear warning.
Figure: A single workflow consists of a series of actions executed by the Workflow Coordinator Agent, which may trigger updates to the project state and/or the user interface.
Meanwhile, the research team has also built constraints around "uncertainty management": code that hasn't passed tests cannot be considered complete; a report that hasn't passed review cannot be directly finalized; if a research path is stuck for too long, the Agent must explicitly expose the problem to the user, rather than masking the logical gaps with a formally polished manuscript.
Figure: Once the research problem and objectives are defined, the Project Coordinator assigns various workflows to advance toward the goals.
Setting a SOTA on the Hardest Math AI Benchmark and Engaging in Real Mathematical Research
In benchmark testing, AI Co-Mathematician achieved 48% accuracy on FrontierMath Tier 4, setting a new SOTA score for AI on that benchmark. Specifically, after removing 2 publicly available sample problems, it correctly answered 23 out of 48 private problems.
FrontierMath is a high-difficulty math benchmark developed by Epoch AI, containing 350 original problems spanning multiple branches of modern mathematics. The hardest tier, Tier 4, has only 50 problems. The Epoch team describes that some problems at this level might remain unsolvable by AI for decades, and human experts typically require days to solve a single one.
By contrast, its base model, Gemini 3.1 Pro, scores 19% accuracy on the same test. Moreover, the research team highlights that among the 23 correctly answered problems, 3 had never been solved by any previously evaluated system.
Figure: Accuracy scores for Gemini 3.1 Pro, Gemini 3.1 Deep Think, and AI Co-Mathematician (also based on Gemini 3.1) on an internal research-level math benchmark.
Real-world use cases are equally noteworthy. The research team points out that these results were all achieved directly by mathematicians, without any intervention from Google DeepMind researchers.
Among them, Oxford mathematician Marc Lackenby used the system to advance Problem 21.10 from the Kourovka Notebook; mathematician Semon Rezchikov obtained a proof route containing key lemmas for a sub-problem related to Hamiltonian systems; and mathematician Gergely Bérczi gained proof attempts and computational evidence on a problem concerning Stirling coefficients. However, in Bérczi's case, the related proof is still marked in the paper as "under detailed human review," and Rezchikov's study comparison is primarily based on anecdotal experience rather than a controlled experiment. This suggests the model's collaborative configuration has real-world value, but it doesn't imply that the Agent can stably and independently complete open-ended mathematical research.
Limitations and Future Directions
The research team also acknowledged the system's shortcomings:
For instance, multiple rounds of review do not necessarily produce more reliable results. Sometimes, a fundamentally flawed argument, after repeated revisions, can increasingly appear as if it has "passed review," while the actual flaw remains unaddressed. Furthermore, different Agents might fail to reach a consensus, leading to endless cycles of revision and rejection, which progressively degrades the quality of reasoning.
At the same time, the Agent system currently cannot operate without sustained human intervention to stably complete long-range research tasks. Prolonged autonomy also means users must cede some control, and the current model's judgment on when to stop and when to ask for help when facing unexpected difficulties is still markedly inferior to that of a human researcher. Additionally, a beautifully typeset LaTeX document can very easily create the illusion of "rigorous content."
Moreover, the research team's expression regarding future directions is also relatively restrained. They believe the more important next step is not purely pursuing stronger result-generation capabilities, but developing new evaluation frameworks to measure collaboration effectiveness, stateful exploration capacity, and rigorous uncertainty management. At the same time, how to control the semantic noise brought by automated output, alleviate the burden of peer review, and preserve the human judgment on the overall value of a paper are problems that future researchers must also face.
Rather than saying AI Co-Mathematician is becoming a "mathematician" capable of independently tackling hard problems, it's more accurate to say it is revealing another possibility: AI existing as an entity with which humans can continuously collaborate during the long, tortuous, and trial-and-error-filled research process.