ChatGPT's Math Evolution! OpenAI Researchers Reveal: From Miscounting to Solving Erdős Problems with Novel Methods; Math as a Key Benchmark for Model Progress; The AI Automated Researcher

Edited by Yu Cheng

How did ChatGPT's math skills evolve from "can't count" to assisting a Fields Medalist?

Today, OpenAI's official podcast released an episode featuring researchers Sebastian Bubeck and Ernest Ryu to answer this very question, as many are intensely curious.

Ernest recently joined OpenAI as a researcher. He was previously a professor in the Mathematics Department at the University of California, Los Angeles (UCLA), focusing on optimization and machine learning theory. He was among the first to try using ChatGPT to solve open problems in mathematics.

In 2025, with the help of ChatGPT, he solved the Nesterov accelerated gradient method problem that had stumped the math community for 42 years. In the discussion, he mentioned that he had previously spent over 40 hours on it and failed, but collaborating with GPT took only 12 hours to crack this challenge.

The other guest, Sebastian Bubeck, previously served as a professor at Princeton University and worked at Microsoft, accumulating nearly 20 years of math research experience. Since joining OpenAI, he has been dedicated to understanding how AI can assist in mathematical research and evaluating AI's progress in solving difficult math problems.

At the start of the podcast, both guests agreed that the news of "ChatGPT achieving a gold medal performance at the International Mathematics Olympiad in the summer of 2025" sent shockwaves through the math community, especially given that ChatGPT was only launched in 2023. Furthermore, Ernest pointed out that in early 2025, ChatGPT still struggled to calculate shared camping expenses for three people or determine suitable meeting times across different time zones.

Why has ChatGPT's mathematical progress been so rapid? Sebastian's response was that they did extensive research and innovation at OpenAI, not just scaling models, but a combination of many factors. It felt somewhat general, lacking specific details.

However, Sebastian emphasized that "math has been a perfect benchmark for observing model progress over the last four years." Today, "knowing math" is still the goal pursued by reasoning models because solving math problems requires long-duration thinking while maintaining logical consistency throughout.

Another development that garnered widespread attention in the math community is ChatGPT's ability to solve an open problem left by mathematician Paul Erdős. By scanning thousands of unrelated papers, the model established connections between two entirely different branches of mathematics, solving 10 Erdős problems. Initially, many mathematicians didn't believe it was real, but now the models have gone further, providing over 10 entirely new solutions not found in existing literature. Sebastian stated these solutions "could be published in top-tier combinatorics journals."

In their conversation, they also discussed the vision of an "automated researcher"—a model or cluster of models that can work autonomously for extended periods. Sebastian explained that AI's thinking time has already spanned from seconds to minutes, to hours, to days. We are now roughly at the 'day' to 'week' stage, and the future aims for 'weeks' or even 'months'.

What will be the role of humans when models become "automated researchers"?

Sebastian's answer: We solve problems because we are trying to understand deeper things, which allows us to better control our environment. We must maintain control and guide AI on "which problems are important." For example, AI does not care about curing diseases; they don't suffer from them as we do, but we care. This will lead us to a very bright future.

Looking ahead, the two researchers believe that mathematics in the AI era will feature more solutions, more fun, greater theoretical interconnectedness, faster and more reliable validation of conclusions, and deeper understanding will be more valuable than ever. People lacking professional backgrounds who use AI often produce "hallucinated proofs" that seem plausible but are actually absurd.

Simultaneously, they anticipate that AI's mathematical abilities will extend to all scientific domains, enabling scientists to be more efficient and powerful, achieving better results.

Below is the full transcript of this podcast, enjoy:

LLMs' Progress in Math Has Shocked Professional Mathematicians

Andrew Mayne: I think many people have this perception that these models aren't good at math; after all, they are called "language models." How did this change? What happened?

Sebastian Bubeck: Yes, I think the progress over the past few years has been nothing short of miraculous. It's important to remember that two years ago, we didn't even have reasoning models, let alone models that could prove difficult mathematical theorems. Fast forward two years, and these models are now capable of assisting Fields Medalists in their daily work. This leap is truly staggering. If I may add one more point, it's crucial that everyone, including ourselves, was surprised by this progress.

Let me tell you a story. About a year and a half ago, I was at a workshop with fellow mathematicians, and I participated in a debate on whether scaling LLMs could help us solve major open problems. This was about a year and a half ago, and the room was deeply divided. In fact, they took a poll at the beginning, and I think about 80% said, "No, this can't happen." Then the debate unfolded. By the end, the split was about 50/50. Pretty good progress in one hour. In hindsight, that was clearly very wrong. Just 8 months later, models started doing research-level math.

Andrew Mayne: What was a breakthrough moment for you, realizing that AI and math had this incredible intersection?

Ernest Ryu: In the summer of 2025, the big news was that ChatGPT achieved human top-tier performance at the International Mathematics Olympiad (IMO), earning a gold medal. That was stunning. It demonstrated that, at least for competition-level math, the model's logical capabilities were very strong, comparable to the best human high school competitors. But competition problems are "routine." Their solutions are relatively short, designed to be solved in a few hours, and they aren't novel because if they're posed, they have a solution. So that wasn't research-level math yet. I got curious, and many others were too: Can ChatGPT do research-level math? There was a lot of discussion online. So I thought, I should try it on my own problems. Instead of listening to others, I'd try and judge for myself, as I am a mathematician.

Ernest Ryu: So I picked a classic open problem in optimization theory, the applied math branch I work in. The specific issue is about a famous algorithm called the Nesterov accelerated gradient method. The question is: Does it always exhibit this convergent behavior, or could there be some divergence under extreme conditions? This is a genuine open problem because while we know the algorithm performs well and converges in most cases, we don't truly know if counterexamples exist. In the worst case, could it diverge? The answer proved to be yes.

I remember very clearly how I discovered it. My son's bedtime is 8 pm, and I try not to stay up past midnight. So if I want to focus on something, I usually have a four-hour window of personal time in the evening. I decided, okay, I'll spend a few days on this. Over a span of three days, totaling 12 hours, I interacted with ChatGPT on this problem. It wasn't a matter of entering a prompt and getting the answer. I played the role of a validator. Whenever the model made a mistake, I corrected it. I also tried to steer the conversation towards methodological areas I felt were novel. After a while, a proof emerged, and I checked it. I even had ChatGPT double-check, and it was correct. Just like that, a 42-year-old open problem was solved. After getting the solution, I wondered, what's the most interesting way to release this? I could write a paper, but that's rather dull. So I decided to go talk about it on Twitter (now X). I had a lot of fun. I think this was one of the earliest instances of AI solving a real, open math problem, and it received a lot of attention. It was truly enjoyable.

In early 2025 and before, ChatGPT Still Underperformed on Ordinary Math Problems

Andrew Mayne: It's interesting you mentioned that; sometimes we see people say, "Hey, I found something cool or novel," and sometimes it gets debunked, other times it holds up. Social media can be a bit scary, but it seems we really need this feedback loop. I think for many of us, the challenge is hearing words like "IMO" and struggling to grasp what that means on the difficulty scale. I understand basic arithmetic. Can you give me an example of how models evolved from barely coping to doing math, using tools, and even implicitly understanding math?

Ernest Ryu: When ChatGPT first came out in early 2023, I started testing it. I was curious how the model performed on ordinary math problems. This included high school level questions and everyday problems with a mathematical nature.

For example, imagine a scenario where three of us go camping. I paid for this, Andrew paid for that. At the end, we want to settle up and split the costs evenly. Can ChatGPT help us calculate? If you bought 17 different things, it's moderately complex. In '23, '24, and even early '25, I remember the model couldn't do it.

Another example: suppose I'm in Korea, Seb is in Paris, Andrew is in California, and we want to schedule a Zoom meeting. When is a good time? Again, in early '25, the model couldn't do it.

But then, suddenly, things changed. I wasn't at OpenAI at the time, so I don't know exactly what you all did, but the model suddenly started solving IMO problems. Furthermore, it began tackling research-level issues. My assessment now is: Unless you are a professional mathematician trying to discover new mathematical theories, if you are a physicist or chemist using complex math (like differential equations, differential geometry, etc.) but not inventing new math, then ChatGPT can handle all the math you need.

Ernest Ryu: Essentially, any advanced math user in STEM fields can now use ChatGPT for their mathematical problems. You still need to exercise a degree of caution, check if the answer is correct, and run simulations to verify. Models do make mistakes. But now, for any math problem 99% of people want to solve, the model can handle it.

Andrew Mayne: When I worked on the GPT-4 release, I used scheduling as an example. I could input three people's calendars and ask it to find a time slot. But pushing beyond that was hard. Why did this change happen? Ernest just talked about suddenly discovering it got stronger. We know part of it is tool use, like allowing the model to use a calculator. But other changes happened to the model itself.

Sebastian Bubeck: Going back to that debate I mentioned, the argument at the time was whether merely scaling LLMs themselves would allow breakthroughs in math research. That's a wrong framework. We did a ton of research and innovation at OpenAI, not just scaling models. When you ask what happened mid-2024 that suddenly enabled models to solve math problems, it's actually the result of many factors converging. We did a lot of research, and all of it had to advance simultaneously. So I can't attribute it to a single factor.

Andrew Mayne: But it did achieve it without tool use.

Sebastian Bubeck: Yes. I think it's important to reiterate what Ernest said about the timeline and the scheduling problems models previously couldn't handle. I said we didn't have reasoning models two years ago; think back four years. Four years ago was pre-ChatGPT. I recall Google releasing a math model called Minerva. I was so astonished I fell out of my chair. Why was I blown away? Because simply by giving the model coordinates of points on a plane, it could draw a straight line passing through them. Talking about this now, people might struggle to understand: "What are you talking about? Models can do that easily." So I think we somewhat forget how quickly things have happened. And now, as Ernest said, unless you want to invent new mathematics, the models have basically reached the needed level. I would even say we've seen glimmers of models inventing new math already.

LLMs Doing Math is a Crucial Benchmark for Measuring Model Progress

Andrew Mayne: Can you break that down? Beyond people interested in developing new math fields or proving new theorems, what impact does this have on other things? What are the implications for science? For the other work you're researching? Why is this so important, not just, "Oh, cool, it can solve problems"?

Sebastian Bubeck: I think the "it can do math" part is very important as a benchmark for measuring progress during model development. The beauty of math is that problems are very clear and unambiguous. Everyone agrees on what's being asked. That's point one. Point two, you can verify the answer. Once the model gives an answer, everyone agrees: it's either right or wrong. Evaluation is easy below the research level, though not so simple at that level. So, math has been a perfect benchmark for observing model progress over the last four years. Now, we could say this aspect is saturating. You can ask, okay, models can now do math, what's next?

For the next step, I'd say making models good at math benefits many other things. Let me explain why. A key characteristic of math is that to solve a problem, you must think for a very long time—maybe days, weeks, or even years. This prolonged thinking requires not just duration but also maintaining logical coherence throughout the process. If one error occurs somewhere in the reasoning chain, the entire argument collapses. Even if everything after that is correct, it's useless. A single failure point breaks the whole proof. This property makes it a target for reasoning models: if they make a mistake, can they self-correct? So we hope this capability gained through math can generalize to other domains. And by the way, this exactly parallels humans. Why do we train humans in math? It's fun, I love math, and we do it for a living. But the reason for training humans in math is identical: it endows you with this very rigorous logical thinking ability.

Andrew Mayne: Do we need new ways to think about and discuss these discoveries?

Ernest Ryu: Yes. I personally see part of my role as trying to educate the research community about recent progress because I have a dual background: as a former mathematician and now working at the frontier of AI. Indeed, Twitter and social media are good places for explaining advances, especially given the rapid pace of progress.

ChatGPT Solves Erdős Problems in Mathematics

Andrew Mayne: For instance, we could talk about the Erdős problems and some of the controversy surrounding them. First, the example Ernest gave, and then there were other problems solved. Could you briefly introduce who Paul Erdős was? I think people would want to know why he's so special and why his problems are interesting.

Sebastian Bubeck: Of course. Paul Erdős was one of the most prolific mathematicians of the last century. I think he wrote 1,500 research papers. He was a very eccentric character. He didn't have a house or an apartment. He simply traveled from one university to another, looking for new collaborators. Wherever he went, he essentially posed problems. He was extremely gifted at asking questions. Not all of his questions were interesting, but they were highly inspirational. The research community co-authored many papers with him. There's even a concept of the "Erdős number," which is the distance in the collaboration chain from yourself to Erdős. My Erdős number is 2. I co-authored a paper with someone who co-authored with Erdős.

Andrew Mayne: Wow, that's impressive.

Ernest Ryu: Mine is 3.

Sebastian Bubeck: There's a joke that you could just take a train ride with him, and by the time you get off, you might have co-authored and signed a paper with him.

Ernest Ryu: Exactly. I think the "2 vs. 3" basically reflects our respective age differences, that's the truth.

Sebastian Bubeck: Anyway, Erdős left behind all these problems. A mathematician named Thomas Bloom built a fantastic website tracking all the unsolved Erdős problems. There are about a thousand problems on that site. Thomas himself is a combinatorics expert. He can mark: this is open, that is solved. Of course, he might not know the answers to all of them. If a problem is marked "open," it doesn't necessarily mean no one knows the answer, but it serves as an interactive platform where people can comment and explain solutions. When we started having GPT solve research math problems, this seemed like a treasure trove for trying out the model. We tried a few. To our great surprise, the model provided answers to some problems marked as "open." We were extremely excited by this.

I tweeted about this around October last year; it was a "deep literature search" result. Let me explain what that means. It means GPT performed an extremely extensive literature search, scanning thousands of papers. It found the answer to the problem in some unrelated area. It's important to understand this: it wasn't that in that unrelated area, someone had written "I'm solving an Erdős problem." It was written in a completely different language, belonging to a distinct branch of mathematics. You had to do the work of connecting these two parts, and GPT did it. That's amazing. This was still somewhat random at the time; we were just trying things manually in the ChatGPT interface. After seeing these, our team member Mark Selke decided to take a more systematic approach, trying all the problems, and the model provided solutions to 10 Erdős problems. You have to remember, there was still a fierce debate then about whether models could go beyond the state-of-the-art to discover or invent new math.

I was very excited about this result and tweeted it. That tweet became somewhat "infamous" because people misunderstood what I meant, thinking the model had truly conjured entirely novel solutions for 10 very difficult open problems never before seen in literature. But that wasn't the case. It related to the previous situation, a "deep literature search." There was even some debate with Google's Demis about how to describe such results. But the focus now is quite astonishing, just a few months later. Back then I talked about solutions to 10 open problems where the solutions existed in the literature. So the question became: Can you find solutions not present in the literature? So far, we have over 10 truly novel solutions, completely publishable in top combinatorics journals. These solutions were entirely derived by ChatGPT or our internal models. This again speaks to the acceleration: in just a few months, we went from "claiming 10 solutions to Erdős problems sounding absurd" to "this is actually happening and accelerating."

AI Will Reshape Our Understanding of the Nature of Scientific Progress

Andrew Mayne: That's fascinating, because the first step seemed to be enabling models to do excellent literature research. Many major papers and awards have previously gone to people who discovered through literature searches that "this problem here was actually solved elsewhere." So achieving that first step is cool, but now it's genuinely doing original research. What I really love about AI research is that it forces us to confront big questions about intelligence, research, progress, and how we discover new things. Specifically, is the progress we see in science just assembling different pieces together with a bit of reasoning, or are there truly those flashes of genius insight?

Sebastian Bubeck: Of course, everyone points to Einstein's theory of relativity, but honestly, I'm not sure that counts. So, whether this process of merely "recombining" plus "a little thought" can indefinitely expand human knowledge, or whether we truly need some kind of "spark of genius" that only humans possess to some degree, is still an open question.

Andrew Mayne: Even Einstein himself credited someone, I forget who, who proposed the analogy and visualization method. He said he didn't invent it; we point to who did, and he just pushed it a step further. I think we sometimes get too enamored with these simple little stories, and reality is often much more complex than that.

Sebastian Bubeck: Yes, exactly right.

Andrew Mayne: If we get better mathematical tools in AI, what does that mean for the wider community of scientists? How does it impact fields like biology, materials science, and others?

Sebastian Bubeck: It's very important for everyone to understand this about how it impacts other scientific fields: we aren't doing something uniquely tailored for mathematics. Our techniques and training methods are very general; they apply to everything. So our expectation is that while we're seeing more progress in math partly because it's so easy to benchmark and track improvements, we fully expect this to happen across all scientific domains, not just math.

Building AI "Automated Researchers": Making Models Work Over Longer Timescales

Andrew Mayne: AI seems very good at this kind of reasoning: "If this is true, then that is true," completing long chains of such statements, which has many applications elsewhere. We've heard the term "auto-researcher." Do you want to elaborate on that?

Sebastian Bubeck: The way we currently work is exactly what Ernest described—it's an "interaction." It's somewhat like a professor-student interaction where ChatGPT is the student. The professor gives an initial problem, the student comes back with feedback, they chat a bit, the student goes back to work for a week, and then returns. Of course, the key point is that this model drastically compresses the timeline. In the case of Ernest solving that problem, it took 12 hours. I wonder, without ChatGPT, how long would you have spent?

Ernest Ryu: Without AI, I had already spent over 40 hours and failed. I don't know, maybe a month if on my own.

Sebastian Bubeck: Exactly. So there's this timeline compression effect. When we talk about an "automated researcher," it's a slightly different vision: a model or cluster of models that can work autonomously for very long periods. If we want to surpass the current level, this is necessary. That professor-student interaction model, where the "student" returns a week later, makes achieving true breakthroughs very difficult. It's hard to solve those longstanding research puzzles or make progress in extremely tough fields like biology, which requires interacting with wet labs and conducting various experiments. To pursue real breakthroughs, we need models to work on longer timescales. That's where the automated researcher comes in.

Or, to phrase it differently, a concept I strongly advocate is "AGI Time". You can have AGI seconds, minutes, hours, days, etc. It means having an AI that can simulate human thought, but for how long? As Ernest said, two years ago, models might simulate a high school student thinking for a few minutes. Now we can simulate a researcher thinking for hours or even days. We really want to move in this direction—and this progress has been very consistent over the past four years. We've literally made the leap from seconds to minutes, to hours, to days. We are now roughly at the 'day' to 'week' stage. We hope to move towards 'weeks' or even 'months.' This is open research; I think nobody on Earth knows exactly how to do it. But this goes back to the point: we are conducting a huge amount of research and innovation. I think when everything comes together, we will see this continuing arc of progress on 'AGI Time.' That's the direction of the automated researcher.

Ernest Ryu: Other mathematicians I've spoken to use AI by opening ChatGPT and conversing within that context window. You can have multiple sessions, but each session has a finite context length, roughly the size of 50 pages of mathematical papers. This isn't long enough for truly deep, groundbreaking mathematical breakthroughs, as many math papers exceed 50 pages. Moreover, the amount of human thought invested in producing a 10-page or 30-page paper is typically orders of magnitude greater than the final output.

So the finite context window is a limitation. But anyone who has used Codex knows you can actually have very long working sessions with Codex. You just keep giving instructions about the kind of code you want to write. And the code you're working on—your codebase (analogous, in a math context, to your written math notes)—can become very, very large. Codex is very good at handling this. It occasionally compresses conversation history; it has ways to be a truly astonishing agent, completing extremely complex tasks across vast codebases and extremely long conversational contexts.

I believe the same will happen for math research. We'll be able to have LLMs solve problems whose thought processes exceed 50 pages. This is what human mathematicians do: people think about a problem for a day, then summarize ideas and write them in notes; come back the next day or next week. Over months, we think a lot, but it's summarized and organized into manageable patterns. Finally, the end product is a 30-page paper condensing months or even years of thought.

AI Empowering Scientific Research

Andrew Mayne: Yes, I think that will happen. This weekend I was working on a problem you'd find laughable, trying to use an LLM to figure out how to make a small-scale LLM do math. Midway, I needed a benchmark, so I found Easy Math (a benchmark for small LLMs), but it was just a paper with not much data. Right in the middle of working with Codex, I said, "Can you create the benchmark here and generate the data for me?" Five minutes later, I had it. That was magical to me, because previously I'd have to spend hours writing generators and such.

Sebastian Bubeck: Absolutely, and it was running in the background. I can't imagine what it's like when you all handle "adult-level" problems. What you described is exactly what we aimed for when publishing that paper titled "Early Experiments Using GPT-5 to Accelerate Science." You experienced literal "acceleration." What might have taken you days of work...

Andrew Mayne: Or I would have just given up.

Sebastian Bubeck: Yes, exactly the point. It empowers scientists everywhere, like enabling mathematicians to use code. Many of our friends don't write code, and now suddenly they have Codex. They can personally conduct all the experiments they previously had to assign to a poor graduate student. Now they can do it very easily. Conversely, thanks to ChatGPT, scientists across all disciplines can now use more advanced mathematics.

Humans Must Control and Guide AI to Solve Important Problems

Andrew Mayne: I once sat with Bob Metcalf, teaching him how to use Codex to write R, because he was working on a project and R was completely new to him. It was a very interesting experience, taking a great mind and telling him, "Hey, you don't need to spend vast amounts of time digging through details—this is your tool." But of course, as you mentioned earlier, we should talk about the human role in all this. Especially when we start thinking about the future. I'm not obsessed with predicting the future, I prefer explaining what has happened... But what do you think will happen?

Sebastian Bubeck: I think there's my gut intuition and the rational consideration. The rational thought is: Look, progress over the past four years has been remarkably consistent. From solving math problems in seconds, to minutes, to hours, to days. There's absolutely no reason to think this trend will stop. Anyone observing this would say: In a year, you'll have systems that can think for weeks; in two years, systems that can think for years. And not only that, today we already find areas where our models can surpass humans, like error detection in papers. We have internal agents that can find papers and point out, "Hey, this is actually wrong; the correct answer is here."

Furthermore, people tend to think AI is only good at answering questions. That's not true; it's also very good at asking them. Of course, this needed some research innovation, and we've done that. Our models are now very good at asking questions, to the point where humans see these questions and think, "Hey, maybe I should write a paper on that." This is already happening. So I'd say, in a year or two, models can do essentially all the things human researchers do. What's next? What is the human role? Why do we do science? What's the point? The point shouldn't be solving problems for the sake of solving them. We solve problems because we are trying to "understand" something.

Understanding is key. We don't solve problems just to publish papers or to prove we can write 10 times more than our neighbor. That's not the point. If you just love solving problems, you can join competitive chess. We are trying to understand deeper things. Why understand deeper things? Because we want to better control our environment. We want to cure diseases, build things better, faster, stronger, and more reliable. So I think, as long as humans keep control and guide 'which problems are important,' we will have a very bright future. AI doesn't care about curing diseases; they don't suffer from them like we do. But we care, so we must control and guide them to solve these problems.

Math in the AI Era Will Be More Fun, More Interconnected, More Reliable, and Faster

Andrew Mayne: When the first computer was invented, when a "computer" shifted from being a human role to an actual machine, some thought we'd all have to pivot from math to physics because physics would have the hard problems while computers would solve all math problems. That was the 1940s and 50s, and it turned out not to be true. Computation opened up entirely new branches. This trend will continue; today's high school mathematician will have a very exciting future 30 years from now, precisely because of everything happening now.

Ernest Ryu: I think math will become incredibly interesting. Before the AI era, we'd spend months solving a problem. It's enjoyable, but the process is extremely arduous. It's painful, truly painful. And when you actually find the solution, there's a dopamine rush. That experience will be accelerated: more solutions, more fun.

Furthermore, I think math will become richer because its interconnectedness will increase. At the research level, much math is very "niche." When you write a paper, you know only perhaps 5 people alive will care about it. But you like the result, so you publish it. 20 years later, it sits in an archive, unread. But now with AI, AI will have read it. If there's some useful connection, as Sebastian mentioned, AI will bring it to the surface. People 100 years from now will discover and use it. So I'm now more confident that if my published research has future utility, it will definitely be used. Also, I can engage with math more broadly. There are fields I haven't studied, but if a relevant result appears, previously I'd have to study that field to use it, and without AI's help, I wouldn't even find that result. Now it's accessible. The model tells me, "Hey, you can use this to solve your problem," and I try it. So math will become a much more interconnected enterprise.

Ernest Ryu: Also, verifying mathematical correctness is actually very complex. Imagine a 300-page proof claiming to solve a very important problem. The author is reputable, and the paper looks plausible on the surface. How do you confirm it's right? This verification process often takes years. It's not enough for one person to read it; many people need to read, try to extend, and delve into the details. This process is very slow. Sometimes, proofs with fatal errors even get published. This leads to an entire field initially accepting a result, only to later discover it's irreparable and must be discarded. With AI, this will be drastically accelerated. Currently, ChatGPT and our AI models aren't perfect at verifying math, but they are already very good. And they are more patient than humans.

Sebastian Bubeck: Indeed. The reality is that much published mathematical research has small errors, and many even have large errors. We know this because we've tested it with models. But I think a richer future for math will come through AI verification. We will achieve greater certainty about which results are correct and which are wrong, and receive faster feedback. A paper published a week ago can be verified immediately. We can confidently build upon it without waiting five years to confirm its correctness. In summary, math will be more fun, more interconnected, more reliable, and faster. Mathematicians will solve harder, more interesting problems.

Preventing Shallow Over-Reliance on AI; Deep Understanding is More Valuable Than Ever

Sebastian Bubeck: I completely agree. But I also want to discuss a potential danger of current progress: we might hand the "keys to the castle" to AI. Humans might start over-trusting the system and stop training hard to master skills. Where we once patiently sat for hours, days, or even weeks to understand a result, now we might just ask ChatGPT to explain it in simple terms. I worry that over-reliance on tools could lead to superficial understanding. So I think it's very important for the audience—every listener—to understand this: Expertise is more valuable than ever. The reason we've been able to extract these results from ChatGPT is because of our years of training and deep understanding of the subject. Without that, we couldn't push the frontiers of technology. We've already seen examples: it's not like thousands of non-mathematicians suddenly started proving new results. In fact, we see counter-examples on social media, where non-mathematicians try to prove theorems using these tools, producing dozens of pages of proof, all of which are wrong. This is a danger we must confront.

Andrew Mayne: That seems likely to become a problem for many things. People often use current models just to reinforce what they want to hear. Like, "I'm going to propose some kind of unified field theory," that sort of thing. Guess what? That's going to be much harder.

Ernest Ryu: This "mental atrophy" issue is also very prominent in programming. I'm not a computer science major, but I took classes and wrote code myself. I battled with debuggers; most people my age went through that. But now, in university courses, you don't even need to experience that anymore. I think that's very dangerous.

AI Will Help the Younger Generation Reach Scientific Frontiers Faster

Andrew Mayne: I hear some people in the scientific community being very optimistic about progress, even saying, "We will no longer need scientists."

Sebastian Bubeck: No, absolutely not. Wow, that's a terrifying statement. I truly hope no one listening says that. That is the exact opposite of what we need. We need scientists more than ever. These scientists will be more efficient, more powerful, and produce better achievements. But we need them to be extremely, extremely proficient in their craft. Obviously, OpenAI can't do everything; existing institutions (academia) have a very important role. Academia needs both to understand the speed of progress and to rediscover its role within this process.

Andrew Mayne: My hope and expectation is that we'll see more people entering the sciences. If you decide to join later in life, as long as you focus, catching up will be easier because you have the world's greatest tutor. OpenAI added visual explanation tools to ChatGPT. Just because an AI model tops a benchmark doesn't mean the job is done. It's like saying, "We solved elementary school math, congratulations everyone, AI is complete." No, there's the next level, and the next, and these require humans.

Sebastian Bubeck: Yes, it will help the younger generation reach the frontiers of science faster. If I had ChatGPT as a teenager, it's unimaginable. I remember looking at Maxwell's equations and thinking, "What does this even mean? How did they come up with this?" Now you can just ask it directly, and it will explain beautifully. This is significant, but you still need to put in the hard work on top of that foundation.

Andrew Mayne: We see in code bases and such, people submit fixes that aren't real fixes, things like that. How do you solve this problem? If I were a mathematician or a journal editor right now, I'd be a bit scared.

Sebastian Bubeck: Yes, I think as Ernest said, AI can also help here. We can deploy AI agents at the other end of the system to inspect all content and verify as much as possible. Of course, we don't want to fully trust AI to validate and decide whether to accept a paper or a review comment, but we can have AI agents flag specific potential issues. For instance, it might alert: "Hey, I'm not quite sure about this part." This accelerates the process, essentially helping humans reduce the amount of work they need to personally verify.

Ernest Ryu: Furthermore, I think the social structure of math or code needs some adjustments, meaning the person submitting the code or controlling the agent must bear responsibility. In mathematics, there's already a culture: if you publish a flawed proof, it damages your reputation. When you publish an article with your name attached, you're staking your reputation on it. I think we need more constraints like this.

Using ChatGPT to Learn Math: Ask Based on Your Blind Spots, Have It Pose Questions

Andrew Mayne: If a viewer or listener is curious about math, perhaps they're interested but don't see themselves as a "math genius," yet want to try starting out, what would you say to them?

Ernest Ryu: Go chat with ChatGPT. If you're interested in learning, it will be extremely helpful. Even at the research level, when I need to learn a new concept, my habit used to be to check Wikipedia, but the content there is very abstruse. After about 30 seconds, I think: Okay, let me ask ChatGPT. I'll pose questions and follow up. Doing this, it provides very practical information tailored precisely to the gaps in my knowledge, because I'm asking based on my own blind spots.

You can introduce your mathematical background, the books you've read, the materials you've learned to ChatGPT, and then ask it to propose a problem that is both open-ended and understandable at your level of expertise. Sebastian mentioned this point; I feel people haven't yet realized that these large models can ask very good questions, but I believe they can. So, having a companion to discuss math and problems with—you can ask the model to help you solve a problem; once you have the answer, you can continue the conversation and pose the next question or a related variant. This makes the process much richer. Even though you are alone in a room, it no longer feels like a solitary process. And this is the true joy of math, because mathematics is fundamentally a social endeavor.

Andrew Mayne: I think fun brainteasers would also be great. I tell people you can start with a seemingly silly question like, "How many M&Ms can fit in a bathtub?" You start asking, and then you might ask: How many words did you read last year? How would you calculate that? And you can begin a wonderful conversation. Before you know it, you're engaging with increasingly complex math and realizing its impact on you. Gentlemen, this was fantastic. Sebastian, Ernest, thank you very much.

Sebastian Bubeck: Thank you.

Ernest Ryu: Thanks for having us.

Reference Link:

https://www.youtube.com/watch?v=9-TVwv6wtGQ

——Recommended Reading——

Domestic King of Token Efficiency! MiMo-V2.5 Pro Top Open-Source Agent; Luo Fuli: OpenClaw is a Massive Watershed, Model and Harness Need to Evolve in Sync, MLA Doesn't Fit the Agent Paradigm

OpenAI Making Phones? Qualcomm Stock Soars! Sam: Current Hardware Isn't Worthy of AI! Ex-Apple CEO: OpenAI is the Biggest Competitor Since Tim Cook's Era

DeepSeek is Ruthless, Two Price Cuts in Two Days! Million Token Input Only 0.025 Yuan! Rivals Are Stunned! Netizens: Domestic Models + Domestic Computing Power is Just This Awesome!

ChatGPT's Math Evolution! OpenAI Researchers Reveal: From Miscounting to Solving Erdős Problems with Novel Methods; Math as a Key Benchmark for Model Progress; The AI Automated Researcher

Related Articles

分享網址