Models Are Too Fond of Cheating! Cursor Reveals the Inside Story of Composer 2's Reinforcement Learning: Models Can Detect 'Fake Environments', and Floating-Point Non-Determinism Is a Fatal Flaw in RL Training

Editor | Yu Cheng

"Sometimes the model can actually detect whether it's running in a simulated environment or a real one, which leads it to perform differently during reinforcement learning (RL) than it does in production."

"Models simply love to cheat, and reinforcement learning is exceptionally good at encouraging this kind of 'cheating' behavior."

Not long ago, Cursor released the second-generation update of its proprietary model, Composer 2, with performance rivaling Claude Opus 4.7 while costing only a tenth of the price, creating a buzz in the community.

An application-layer company transforming into a genuine frontier model lab—people are genuinely curious about how they pulled it off.

As it happens, Sequoia's official podcast recently invited Federico Cassano, the research lead for Cursor's Composer 2 project, and Dmytro Dzhulgakov from Fireworks, which provided the distributed infrastructure for the project, to discuss the topic of "how Composer 2 was built."

Let's dive right in.

When the new model was released on May 19th, it was officially mentioned that, building upon the open-source Kimi K2.5, 85% of the training compute was spent on mid-training and reinforcement learning. This conversation largely revolves around these mid-training and RL processes.

One thing that first stunned me: Federico mentioned that during the RL process, if the simulated environment differs even slightly from a user's actual computer environment, the model can perceive that it is in a "fake environment."

And once the model realizes this, it will use some "tricks" to "cheat" for higher reward values, rather than truly learning how to solve problems.

Federico's insights on the role of RL for the model were also fascinating. He said that pre-training allows the model to absorb all of human knowledge, while the RL phase is like adjusting a "knob" that makes the model understand, "Hey, you're an expert, you need to get things right."

Moreover, during RL training, the model interacts directly with Cursor's Harness, which allows it to understand "the 'world' it will inhabit for the rest of its life."

Dmytro also shared a staggering fact for outsiders: floating-point arithmetic on computers is non-deterministic, meaning the result of a+b+c is not necessarily equal to c+b+a.

This significantly impacts RL because Mixture-of-Experts (MoE) models are extremely sensitive to precision. Such computational discrepancies can cause the model to activate entirely different expert nodes, producing completely different outcomes.

Therefore, tiny computational differences can cause RL training to fail entirely. To solve this, engineers need to manually write complex GPU kernels to enforce a consistent order of operations.

On the allocation of inference compute, Federico also had a unique perspective. He believes the idea that you consume far more FLOPs on inference than on training is a myth. In an ideal scenario, roughly 1/3 of GPU compute should be allocated to inference, with the rest used for weight updates.

Additionally, they believe that for specific domains, model "specialization" trumps "generalization."

The traditional belief is that bigger, more general models are better, but after specializing all model weights for the task of "software engineering within Cursor," they found that a smaller model can approach the performance of a large model like Opus, while costing an order of magnitude less.

Furthermore, RL is also run on the Harness. By performing RL optimization on the system comprising a part of the Harness and the model itself, they trained Composer 2 to develop "self-summarization" or "compression" capabilities, allowing the model to effectively process millions of tokens beyond its 200k context window.

He also shared his understanding of the RL environment, which he believes is composed of three parts: the Harness, the operating system, and the reward component.

Of these, the Harness is usually portable, with the operating system being the critical element. So at Cursor, they built their own entire virtual machine stack with extreme burst capacity, allowing them to instantly spin up 100,000 virtual machines for model experimentation on demand.

As for the reward component, when the host asked what kind of reward signal they use in RL, the answer was a no-go: "top secret."

They also couldn't answer another question, referring to Andrej Karpathy's comment that current RL is still extremely inefficient— conducting a long "rollout" only to receive a tiny bit of information at the very end feels like sipping bits through a straw. Their approach to "squeezing more bits out of that path" remains confidential.

Below is the full content of the podcast. Enjoy.

An Order of Magnitude Cheaper! Model Weights Specialized for Coding Tasks

Host: Great to have both of you here to chat today about how the training for Composer 2 was completed, what difficult problems you solved together, and what you think this means for the future of AI and foundation model companies.

Federico / Dmytro: Sounds exciting. Yes, very exciting. Thanks for having us.

Host: Thanks for joining us. For those who haven't been closely following, Cursor recently released Composer 2, an agentic coding model specifically for long-horizon coding tasks. Federico, before this, Cursor was mainly empowering other people's coding agents. What drove Cursor to invest so heavily in developing Composer 2? How existential is this transition for you from being a pure application-layer company to one that also owns its own foundation model?

Federico: The reason we started looking into training our own models is that you can think of a model as a kind of storage drive; it can store a certain number of bits in its weights. The idea is really simple: we only care about one task. We don't even necessarily care about coding or programming broadly—we care about software engineering inside Cursor, and only within Cursor. So, what if we took all the information bits that can be stored in the model weights and allocated them entirely to this single, specific task? Furthermore, you might have noticed that Composer's cost is an order of magnitude lower than Opus and other similar coding models. This is because we can just specialize all the model weights for that specific task, allowing us to deliver a smaller model or something similar.

Dmytro: Yes.

Host: So the core idea is making sure every bit of weight or information we have is focused on the specific problem we're trying to solve in front of us.

Federico: Exactly.

Host: Got it. That sounds like an almost generalizable problem. Dmytro, I'd love to get your take. Do you think all application-layer companies should look at Cursor as a bellwether for the future? Should they all be trying to do the same thing?

Dmytro: Yeah, absolutely. I mean, we generally view this as a universal pattern in application evolution. You probably start by building a prototype using some off-the-shelf models to get your project off the ground. You might do some prompt engineering and figure out how your harness works. But the most leveraged properties of your application are actually the utilization of user data or certain specific dimensions of how your application runs—like aspects of your framework, what tools you provide, how the application works. These are critical to your app. You can capture a little bit of that through prompting, but the truly correct way is to build a model that operates entirely within your specific environment.

Federico: Yes, absolutely. For instance, with some tools an agent calls, it's hard to precisely describe that tool's behavior to the model in just a few words. But through post-training, we can bake the best ways to use those tools directly into the model. Just like with Composer, we do provide Composer with a prompt, but I think, with the way we trained it, it would work even without the prompt; it would just know what to do. Because throughout the entire training process, we were essentially constantly pushing the model in the correct direction of how it should behave.

Dmytro: There's a ceiling to what prompt engineering can achieve. This is one reason why, if you want to build truly great AI products, you have to go through fine-tuning and influence the model's behavior. The second reason is the trade-off between cost and speed. Our view at Fireworks is that when you try to optimize, you face a three-dimensional trade-off between quality, speed, and cost. Initially, you can get quite far just by optimizing infrastructure—that's how we and all our clients started. But when you start intervening in model training, you can really push that trade-off much further. You can get better models at extremely low cost and running very fast. Composer is a great example of that.

Host: Can I push you on that a bit further? I wonder how this approach fits with the "Bitter Lesson." Before we walked in, we were actually chatting about Tabnine. I remember before the era of LLMs, there were a lot of these small, specialized coding models. But what surprised many people is that as models scaled up—you know, just by training on the internet, on massive amounts of English text, and other languages—the models themselves naturally got better at coding too. So, the trend I've seen so far at least is: bigger models are better at everything, including coding. Is what you're saying counter to the broader trend of the "Bitter Lesson"?

Federico: I don't think so. But one point to note is that the large models trained by the major labs also invest heavily in code data. Code is one of the core tasks labs most want to advance, so large models aren't just naturally generalizing to coding; they have a certain degree of specialization themselves. In our case, if we believe in the "Bitter Lesson," we are actually pushing hard on the data dimension. We know that a model's capacity is fundamentally limited. So if we want to saturate all that capacity, we need to scale up the data. And to inject more data, we need to free up the weights from potential interference that the model might face.

Host: Got it, that's very interesting. Okay, let's dive into the training of Composer 2.

Dual-Axis Training Empowerment: RL Helps Model Better Understand the Cursor Environment and Write Correct Code

Host: You released it a few weeks ago, and it immediately captured everyone's attention. Strong benchmark numbers, much lower inference running costs. Can you give us a condensed version of how Composer 2 works and what you did to make it perform so well?

Federico: We started from a very powerful base model, which is Kimi 2.5. That's a model with 1 trillion total parameters and 30 billion active parameters, so it's very sparse. Actually, we looked at the entire tech stack and realized there are essentially two axes. Composer 1 primarily worked on just one of those axes, which was reinforcement learning (RL). Composer 2, however, pushed on two different axes simultaneously: one is continued pre-training, and the other is reinforcement learning. What made Composer 2 really good was this two-pronged approach. We conducted a massive amount of "mid-training" on code tokens early in the training process, at a scale almost approaching pre-training. And after this mid-training period ended, we took the checkpoint and performed very large-scale reinforcement learning on a huge number of tasks.

Host: I see. The premise here must be that, because Cursor sits at the center of so many interesting coding tokens, you have a very unique data access advantage that allows you to train at a scale nearly comparable to pre-training.

Federico: Yes.

Host: So why not just pre-train your own model from scratch?

Federico: We just tend to think about our approach top-down rather than bottom-up. In other words, how can we deliver a model that's useful to users in the shortest possible time? If we started from the very bottom—first figuring out how to do pre-training, then scaling it to mid-training, then figuring out mid-training before doing RL—it would take a very, very long time to get a model into the hands of users. By operating in reverse, we were able to provide a good model to users in an extremely short time. Of course, we hope the next Composer version will be our own model, not based on an open-source base.

Host: Understood. So, for you, what does the model roughly learn during the mid-training phase, and what does it learn during the post-training phase?

Federico: During mid-training, it primarily learns various codebases and very common, specific code patterns, along with some world knowledge, including web data. This essentially creates a broader distribution upon which the subsequent reinforcement learning can focus and refine. During reinforcement learning, the model can directly interact with Cursor's harness. So, in a way, it gets to understand the 'world' it will inhabit for the rest of its life, right? In RL, it learns how to call tools correctly, how to navigate its environment, and how to write correct code. Because in mid-training, it just learned 'how to write code,' but that doesn't necessarily mean it learned 'how to write code correctly.' Although we tried to train it with basically correct code, the model itself doesn't know how to distinguish right from wrong. But during the RL phase, one of the core things we do is fine-tune the model's characteristics, telling it: 'Hey, you must now always write correct code.'

Host: Absolutely.

Federico: Very interesting. So, is this model after mid-training similar to the one you use in Tab autocomplete, or is it a different core capability?

Federico: Yes, I mean... I think I'd look at it this way: during mid-training, we're just doing next-token prediction—predicting the next token and the accuracy of the token after that.

Host: In that case, why not just do post-training on your Tab autocomplete model? Why do mid-training for a different model?

Federico: Because Tab is a very small model; it needs to run extremely fast to achieve very low latency. So there are two core differences in the base model: Tab is very small, while Composer is quite large.

Core Dilemmas of RL Training: The Multi-Dimensional Trade-off Among Session Rollouts, Model Updates, and Compute Utilization

Host: Got it, got it. Okay. So it sounds like most of the heavy lifting you did for Composer 2 was concentrated on this massive reinforcement learning run. Can you help us unpack that? What does it involve, and what different problems did you solve along the way?

Dmytro: When you do reinforcement learning, it's very different from pre-training or mid-training because you're not just trying to predict the next token; you're actually running the whole framework, the whole experiment. You're having the model act in an environment and seeing how it performs in a specific "rollout"—that's the term of art here. And you assign it a reward based on whether it correctly accomplished certain things. This could be through using a Large Language Model (LLM) as a judge, or through verifiable metrics, like whether this piece of code compiles successfully. This means, compared to regular training, you need a lot of other components: you still need large-scale training, still need to orchestrate thousands of GPUs for forward and backward passes, doing everything you do in mid-training and pre-training. But now you also need to orchestrate a whole bunch of environments and run model inference. Because when you're doing these 'rollouts', you are basically running a real Cursor session, right?

Host: Sorry to interrupt. A 'rollout' basically means the entire agent history session from Cursor, correct?

Dmytro: Yes. That basically means it could involve 50 turns of conversation: the model receives an initial prompt, then decides to call some tools, you have to execute those tools, then the model generates a bunch of other code... This is the complete session of how you interact with the agent inside Cursor. During a training run, you simulate this whole session, get a final reward, and use that signal to feed it back to the trainer to be incorporated into the model weights. So you have a very large, heterogeneous update loop, because all these different components work together. And now you're trying to orchestrate all of this to run with high efficiency and high throughput, because GPUs are very expensive, and you want to train your model quickly and economically.

So, this is a very interesting problem in itself, sitting at the intersection of algorithms and infrastructure, because there are many trade-offs in how the system is co-optimized and co-designed. One aspect is what people call an "asynchronous pipeline." The core idea is that you're trying to update this model in steps: you have a current model version and you try to generate many 'rollouts' with it. So, what is your trainer doing while you're generating these rollouts? A naive approach would say: 'Okay, I'm going to now stop my trainer. I'm going to go run a bunch of sessions,' and for long-horizon tasks, these sessions might run for 5 to 10 minutes or even longer. 'Then, I take these results, pause my inference, and go back to training to try doing the update.' This is theoretically very robust algorithmically because you're not introducing any bias, but it's very inefficient system-wise, because you have half of your compute sitting idle at any given time. So instead, you can use all these clever algorithmic tricks.

Host: Yes.

Dmytro: You can turn all this into a pipeline. Think of it like a huge factory: you have a trainer workshop and a rollout workshop, and they're just spinning constantly. The rollout workshop always takes the latest model version and tries to run new sessions, simulating new agent sessions. Meanwhile, the trainer workshop always grabs new results as soon as they come out and tries to compute the update. So everything is constantly moving forward. The reason this trade-off is algorithmically different is that when you finish some test 'rollouts' in a simulated environment, your model's weights might have already been updated on other data. So you get this staleness—a delay between the model learning updates. Because by the time you finish processing some interactive session with the simulated environment, your model has already changed. This introduces interesting training dynamics, and there are clever ways to address this. But the flip side is that all your GPUs, all your computational resources, stay fully loaded and running efficiently at all times, meaning you're using more floating-point operations per second (FLOPs). Going back to your earlier example about the 'Bitter Lesson,' this yields higher computational efficiency.

Host: You can get a better model in a shorter time.

Dmytro: Yes. You might lose a few percentage points of effectiveness because of the asynchronous operation and not doing perfect mathematical updates, but since you're not leaving half your compute idle, it more than compensates. There's deep science and interesting interactions here.

AI "Cuts Corners" Too! Revealing Model Cheating in Virtual Environments

Federico: We take performance very seriously at Cursor, because unlike the big labs, we have tens of thousands of GPUs, not millions. So, yes, we pull out every trick in the book to squeeze every bit of performance from the GPU. For example, we even use FP4 for training in production; we partner with Fireworks to push inference performance. What's special about the infrastructure is that it's inherently more complex than pre-training. Because first, you need all the pre-training infrastructure, which is just one of the basic requirements; then, you need all the infrastructure to run these environments. And these environments must simulate the reality of a user's computer as closely as possible. Getting as close as possible is crucial, because sometimes the model can actually detect whether it's running in a simulated environment or a real one, which leads it to perform differently during RL than it does in production.

Host: Have you seen it realize it's in a virtual environment and start behaving differently?

Federico: Yes, we definitely have.

Host: That's really interesting.

Federico: It's like it thinks: 'Oh, I'm in a virtual environment. I've learned some tricks that get me higher rewards in this environment, let me try them out.' Models simply love to cheat, and RL is exceptionally good at encouraging this cheating behavior.

Federico: Yes. And we also need highly efficient inference, which is critical. There's actually a myth that during the RL process, you consume far more FLOPs on inference than on training. This is simply because open-source inference engines are very poorly optimized, not an inherent property of RL. In theory, the ratio should be roughly equivalent. If you push GPU performance to the limit, you should allocate about one-third of your training GPUs to inference, right? Because training is effectively equivalent to three forward passes: a forward pass, a data gradient computation, and a weight gradient computation. And if you truly hit the critical batch size during inference, you only need FLOPs equivalent to a single forward pass.

Host: So that's why you chose to use Fireworks instead of an open-source inference engine.

Federico: Yes, I mean, the alternative was to build one internally, but like everyone else, our engineering team is limited. We'd rather have our engineers improve training efficiency and precision than start a separate project to develop an inference engine.

Inference Training Uses Globally Distributed Deployments, Also Pulls Some Production Compute

Host: Got it, this is super hardcore. Also, I remember you mentioned in your technical report that you do this in a globally distributed way. Why go globally distributed? And what are the difficulties?

Federico: Yes, for many reasons. First, it's very hard to find very large single contiguous clusters on the market. So what we did was use one main cluster to run all the training—after all, we can't build a globally distributed training cluster. But, we could distribute the inference component of the reinforcement learning globally, deploying it to smaller clusters around the world. When training Composer 2, we used a total of four clusters spread across the globe and quite far apart. We even utilized a portion of production compute when production traffic was low. For example, at the time we were serving the previous generation model, Composer 1.5, when user activity was at its lowest, we would just pull some inference GPUs to accelerate training. This way, we could easily scale up training without needing a single, massive contiguous cluster. As for how to achieve this, maybe Dmytro can elaborate.

Dmytro: Right, just to add to what Fed (Federico) said: our training is inherently very heterogeneous. By leveraging this heterogeneity—the different infrastructure requirements of the different components—you can actually boost efficiency significantly. This pattern works time and again. Specifically, for training, you need highly interconnected clusters, super-fast networking, and you need to work in lock-step. So these clusters are very expensive, and finding truly large ones is difficult.

Basically, at the scale of training something like Composer, finding a cluster twice the size is much, much harder than finding one of the current size. That's why, if you can disaggregate these components and deploy them in different places, on the one hand, you no longer need to find such a large cluster. On the other hand, you can make different hardware trade-offs because inference doesn't require such high global interconnect bandwidth; you just need smaller groups of GPUs interconnected together. You can use heterogeneous types of GPUs, even different generations, and play all sorts of optimization tricks. Finally, inference is also easier to scale elastically with demand. Right—during off-peak hours, you can view the entire inference resource pool as a set of GPUs that can both serve production traffic for real users and run simulated environments for RL, balancing between the two. Of course, it's a very interesting systems engineering problem.

A 1TB training step takes roughly 5 to 15 minutes. That basically means every 5 to 10 minutes, you produce a brand new 1TB snapshot of the model weights. So the question becomes: how do you efficiently transfer this to another cluster on the other side of the planet? And you have to move fast, because as mentioned earlier, you can't let this staleness get out of control. So this is probably the most interesting part, and the problem we solved together. Even though the whole model is 1TB, not all the weights change at every step. Because RL does a lot of very precise fine-tuning, especially as training progresses. Actually, there are very regular patterns in the subset of weights that change each time; not all weights change every time. If you look at how the model changes in a single training step—say, 10 minutes later—the deltas between them are relatively small. So you can write a compression algorithm that leverages this property. Then the problem becomes similar to a database system problem: I've got this delta, I just need to ship it around the world. This delta might be 20 times smaller than transmitting the whole model, making the approach feasible.

Of course, now you need to build an entire mechanism around the storage system for this. That is, mechanisms for full snapshots, incremental snapshots, recovery, and reconciliation. We managed to build it in a lossless way, which means the model arriving on the other side is bit-for-bit equivalent. So you never have to worry about any errors there, and it runs extremely fast. Even under the worst network conditions, you can finish the transfer in a few minutes, often under a minute. Most importantly, you only need to pause for about 30 seconds to switch the weights over in the actual inference process. We also fully saturate the cluster's egress bandwidth by sharding the uploads and downloads. So you can use all the system-level tricks to drive down staleness.

It's indeed quite complex, but you can abstract it away, make it work flawlessly, without interfering with your training algorithm. In return, you get this disaggregation capability to leverage other clusters for this work. This essentially overturns the conventional wisdom on how to build RL infrastructure. Because the conventional wisdom says you must have a super-large cluster interconnected by RDMA, which is incredibly expensive, and you might need to allocate one-third of the compute for training and two-thirds for inference. Sure, if you have very expensive networking, fast copying of that 1TB of data is easier, but that means you need a cluster that's three times as large. Now, if your inference engine is more optimized, you'll save on one-third of the cluster's GPU count anyway because of higher efficiency. And you can place half of that cluster somewhere else, using cheaper hardware in other regions, drastically reducing costs.

Floating-Point Non-Determinism: A Fatal Hidden Danger for RL Training of MoE Models

Host: I love the smiles on your faces as you're describing all this, because it's just so difficult. It's a systems engineer's dream, right? This system you guys have built is just mind-blowing.

Dmytro: We pulled many all-nighters for this.

Host: Yes, you can tell you've been in the trenches together for a long time. Getting back on track, you mentioned at the beginning that Kimi is a very large, sparse model. Does that make running RL difficult? How so?

Federico: When you do inference, you're essentially doing an autoregressive forward pass. In that forward pass, it produces the log probabilities of the tokens it samples. When we send those samples generated by the model back to the trainer, we have to re-run that forward pass. Because, as mentioned, we do asynchronous training, the model version that generated those samples might actually be several steps behind the trainer's current progress, so we have to rerun the forward pass to recompute the log probabilities. The difficulty now is, theoretically, if the model version is the same, these log probabilities should be exactly identical. But even for the exact same model version, and for the exact same tokens, you'll get slight, and sometimes quite significant, differences in the log probability values. This is commonly known as a 'numerical mismatch' in inference, a term you hear often now with Mixture-of-Experts (MoE) models.

Host: Why is that? Why does this happen?

Dmytro: It's mainly because, fundamentally, floating-point arithmetic that models perform is nondeterministic.

Host: Sorry to interrupt, floating-point arithmetic is nondeterministic?

Dmytro: Right. We learned in school that if you compute a + b + c, or if you compute c + b + a, the result should be the same. If you do this calculation with integers on a computer, that holds true forever.

Dmytro: But if you calculate it with floating-point numbers—which are effectively approximations, containing mantissas and exponents—a + b + c and c + b + a will give you different results. So fundamentally, all operations a model does are basically multiplications and additions, and the order in which these additions accumulate affects the final result. These are tiny differences, but across thousands, or even billions, of operations, the differences get amplified. During regular inference for a model, this usually isn't that important, because a pre-trained model is inherently quite robust. Even if a few bits are flipped, it can still produce great outputs, and benchmark scores won't change.

But in RL, because you are teaching the model with a very weak signal, the noise introduced by this numerical discrepancy can determine the success or failure of your training. This is particularly important. Again, this is an interesting intersection of algorithms and systems, because you can write down beautiful math, but in practice, it just won't work. There are methods to reduce this discrepancy to nearly zero, like adopting a 'batch-invariant' approach—very carefully writing all GPU kernels to always add numbers in the same order, always performing a + b + c and not some other order. This is perfectly possible, but always comes with a trade-off, such as your system possibly becoming 2 to 3 times slower. Then it becomes an interesting trade-off: how much performance loss can we accept? For instance, accepting a 10% slowdown (in practice, it can be just a few percent) to resolve 90% of the numerical discrepancy. This is the optimal balance point we found together through continuous iteration.

As you mentioned, sparsity makes all this particularly difficult. The reason is in how a Mixture-of-Experts (MoE) model works: you take the activations from each layer and feed them into a gating layer. The gating layer basically decides: for this current token, which 8 experts out of 384 should I run? It does some math, and the top 8 scoring experts get activated; the other experts stay idle for that token. This operation dramatically amplifies tiny numerical differences. Because your hidden states might have an extremely slight difference at the fifth decimal place, which would otherwise be harmless, but this difference precisely causes the cutoff to select expert number 7 instead of expert number 9. As a result, you suddenly activate a completely different part of the model, severely amplifying the previous discrepancy.

Therefore, MoE models are, by definition, far more sensitive to this numerical mismatch. During regular inference, this usually doesn't matter—it averages out. But if you're now trying to get the model to learn based on this, that discrepancy is fatally large. Because during inference, you activated expert number 7, but during the training update, you're trying to update expert number 9, which didn't even contribute during the inference step.

Host: So did you guys have to resort to hand-writing GPU kernels to solve this issue?

Dmytro: Yes. You can solve many throughput problems around this; there are always trade-offs. Specifically for MoE, you can use an interesting trick called 'router replay.' Basically, you can have your inference engine pass some extra information to the trainer, like telling it: 'Hey, I activated expert number 7 for this token.' This tiny piece of information is just an integer that identifies which expert was activated, so the trainer can align with it. A lot of numerical alignment work is essentially using tricks like this—matching quantization levels, matching kernels, etc.—to minimize the divergence between the training and inference implementations, and this can make a huge difference. Otherwise, either your entire training run might diverge, or you suffer a massive drop in computational efficiency because you need far more data to compensate for the mismatch.

Offline RL Lays the Foundation to Prevent Bad User Experiences

Host: I'd love to chat more about the specifics of your RL setup. Can you share what kind of reward signal you're using? Or is that a no-go? Understood, it's confidential, top secret. Okay, that's fair.

So, since learning in a simulated environment is effectively doing simulated 'rollouts,' and you have such a massive amount of real user data to learn from, why not just do RL directly on real user data and within the framework of a real user environment, instead of in simulation?

Federico: Actually, we do that as well. That's what we call 'real-time RL.' We use the same technology, via Fireworks, to synchronize the inference weights. We capture user signals—like whether a user was satisfied or disappointed with a piece of code the model generated—and we can then update that model in real-time, enabling us to continuously deliver a new model version every few hours. We are working on shortening that cycle. Though, at some point in the future, we will actually have to lengthen that cycle again, because as the context horizon of models gets longer and longer, we will be forced to stretch the time back out. It's an interesting tug-of-war: right now, for stability, we're working to shorten the cycle to figure out the right hyperparameters; and once we nail those, we'll have to stretch the time back out to further extend the long-horizon processing capabilities of these models.

Host: Given all this actual user data, I'd imagine it would be far more valuable for training and fine-tuning. Do you still need to do any simulated RL akin to a pre-training phase? Why not just jump straight into online RL? Why is offline RL still necessary?

Federico: Currently, online RL is incredibly inefficient. One problem we face is that GPUs basically sit 'offline' and idle for long stretches. Beyond that, there are different trade-offs between efficiency and user experience. With simulation, you can actually perform multiple 'rollouts' from the same prompt. That is, you take one task, and let the model attempt it 16 times or 128 times, branching out into different rollout paths from the same prompt. Some of these will go well, others won't. By running multiple rollouts in parallel, you get an extremely precise signal. Algorithms like GRPO (Group Policy Gradient) function by conducting multiple rollouts simultaneously. Whereas, if you're running online, you only get one rollout signal back at a time, so there's a big algorithmic trade-off. Most importantly, if a simulated rollout goes wrong, it's no big deal; you've just wasted some GPU time at worst. But if you're facing real users, the minimum bar is much higher, because you're effectively running an A/B test. If the model outputs something strange, that's a terrible user experience.

Host: Got it, so when it's not a real user, you can go off-policy more frequently because you can try all sorts of crazy ideas without worrying about impacting the user experience. You can do more rollouts, use GRPO, and then essentially bootstrap the model's performance up to a level good enough to put in front of a user.

Federico: Exactly. We use the offline (simulation) phase to teach the model to reason. Actually, when we say offline, it often refers to techniques like DPO, while REINFORCE is more online. During the offline phase, we teach the model how to reason and imbue it with the behavioral patterns it should have. We try to inject new information about this world into it and teach it to call tools. Only after that, do we push it live in front of users.

Because you can imagine, if the model is bad, users won't want to use it, and thus won't give us any feedback, right? So, the model has to reach a certain threshold before it can be introduced into online RL. We ourselves must be extremely happy with the model before releasing it. That's the paradox of online RL (or what we like to call real-time RL): you can't use it to create a model from scratch, because you need users to use it first. So, it must already be good enough on its own; we can only make it better via online methods.

(Note: DPO is a highly efficient fine-tuning technique for aligning large language models, primarily aiming to make the model's output better match human preferences. REINFORCE is a classic policy gradient algorithm in reinforcement learning, proposed by Ronald J. Williams in 1992.)

Federico: Yes. It's kind of like the cherry on top, delivering a super delightful experience for the whole session. Hopefully, one day, it can become a giant cherry.

Dmytro: Yes.

Host: Yes. This reminds me of what Dan Roberts presented at our conference last year; I think you were there. Traditionally, it's a big cake and a small cherry.

Federico / Dmytro: Now it's a small cake and a big cherry. Right.

Host: I'm curious about Andre Karpathy's famous quote, where he said current reinforcement learning (RL) is still incredibly inefficient. You conduct a very long rollout, but only get a tiny bit of information at the very end—it feels like sipping bits through a straw. What's your take? Have you figured out any way to squeeze more bits out of this path?

Federico: Uh, I can't disclose that.

Host: Okay, okay, got it. We're back in the classified section. Good, that means I'm asking the right questions.

200K Context Window Taming Millions of Tokens! Cursor's Secret: "Self-Summarization"

Host: You mentioned each rollout takes a few minutes. The whole field seems to be moving toward building "long-horizon agents"—agents that can work for extended periods without interruption and generally don't fail. I really liked that meter scaling chart. What needs to happen during reinforcement learning to make agents run for longer durations?

Federico: A few things. First, one difficulty with RL is that the longer the trajectory, the harder the "credit assignment" becomes. You can imagine, we only give the model a thumbs up or down at the very end, after it finishes all its work. Simply put, the model must ask itself: "What did I do right, and what did I do wrong?" That's the credit assignment problem. The longer the trajectory, the harder it gets. So you have to use many tricks there. Another problem is running out of space; these models have limited context windows, and they'll eventually hit the ceiling.

At Cursor, we solve this by incorporating "compression" into the loop. We call it "self-summarization." During reinforcement learning, the agent actually learns how to keep going, perpetually. In practice, our model has a 200K context window, but it can actually process millions of tokens. This is precisely because it has the ability to summarize its own work, then use that summary to restart its context window while still striving to complete the assigned task. By doing this, as we push the model to act correctly towards the goal, we are simultaneously co-training it to generate high-quality summaries and to perfectly understand that summary. So it's almost like an extension of reasoning capability.

Dmytro: I find this fascinating. Because typically, context management is considered part of the harness, right? But in this case, you're actually co-optimizing a part of the harness with the model's own work, throwing all of it into the optimization loop. We see time and again in AI: the more compute you pour into a problem, the more you can solve it end-to-end. The magic of compute from the "bitter lesson" strikes again, giving you a system where components work better together.

Host: Absolutely. Do you think every company will run RL on their own harness? Do you think the problem formulation is the same for every company as it is for Cursor?

Federico: If they are using AI, generating a huge number of tokens, and have a product that can be optimized, I believe training their own model is the right move and the right direction.

Host: Got it, that's interesting.

RL Tells the Model Its Role Is to Be an Expert and Get Things Right

Host: It seems like most of your reinforcement learning focuses on the harness and tool-use aspects, rather than making the model good at "predicting the next token of code." For other founders thinking, "Where should I use RL?" is this roughly the pattern they should follow? Meaning: If you want an agent to perform tasks using tools over a long horizon, you need RL. If you just want to create a model that's good at summarization or next-token prediction, you probably don't need RL. Is that a good framework for judging when RL is necessary?

Federico: I think RL applies everywhere. Even for Tab autocomplete, we used... of course, this is just my personal theory, without any backing. When you pre-train a model, the model is simply absorbing the entire corpus of human knowledge. Suppose you are training a math model; the model learns all the math knowledge on Stack Exchange. When this pre-RL model faces a math problem, it needs to figure out what role it's playing—is it an expert, or a student trying to learn? So, I think one of the things that happens during RL is that we're tuning this knob to tell the model: "Hey, you are an expert, you need to get things right." So one change is that we are distilling this distribution, which happens in roughly a few stages. In the first phase, the model learns very fast and gets very good quickly; then comes a second phase requiring massive compute to continuously improve the model, where you start seeing reasoning abilities emerge and this pattern develop. In that first phase of the curve, I think we are simply tuning the knob to tell the model: "Hey, you gotta get things right here." So even in lower-compute scenarios, it's very useful for the model to know it must get things right. That's my point.

(Note: Stack Exchange is a network of specialized Q&A communities, each focusing on a specific domain, where users share high-quality knowledge by asking, answering, and voting on questions.)

Dmytro: I strongly agree. We see this pattern across many use cases. We've helped many customers with RL fine-tuning, and often you find that continual pre-training (basically mid-training) and regular supervised fine-tuning (SFT), in an abstract sense, are more like the transfer of new knowledge; whereas RL is about refining behavior, or refining specific qualities you want the model to possess. Often you need both. Take even the summarization example you mentioned earlier—RL is actually very useful there too. Because sometimes, if you want a very specific style of summarization, it's hard to come up with perfect examples of a "good summary" and a "bad summary" to precisely describe it. But if you use an LLM as a judge, you can actually set very precise evaluation criteria. You can use prompts to say: "Okay, this is how I evaluate if a summary is good or bad," and throw it into the RL loop, letting the model try different summarization styles and figure out what you really want, while another LLM evaluates if it meets the specific criteria. You see this pattern not only in coding but quite frequently in other domains as well.

LLM as a Judge, Software Writing Phase 3: "Writing Evaluation Rules"

Host: Okay, I'm going to throw this next question to Dmytro, because Federico is sure to plead the Fifth (choose to remain silent). You've mentioned "LLM as a judge" a few times now. Will companies that have experts manually inspect RL rollouts and somehow manually guide model behavior end up more successful, or are LLM-as-a-judge and other automated evaluation criteria more likely to get us where we need to go?

Dmytro: You can't really have experts reviewing every single rollout. I mean, that would just become some form of... if it's a real user, that's live RL, or some kind of RLAIF (RL from AI Feedback) or DPO. In general, the more verifiable your reward signal is, the better, because that lets you scale compute and, in some cases, get better results. "Verifiable" basically means: can you automatically generate this signal without human intervention? Of course, if it's math or coding, and you can engineer something very deterministic, that's the best. LLM as a judge works because it actually relies on the distinction between generator and discriminator—judging is much easier than creating. It's the same for humans, right? Being a critic is much easier than being a VC.

Host: Haha, no veiled implications intended.

Dmytro: Definitely no other meaning. But really, judging is much easier, and you can precisely engineer different criteria for how you want to rank a response. You'll see a pattern where you might design very complex evaluations across multiple dimensions. Because if you throw multiple dimensions at a single LLM, it might get confused about how to judge, right? So you can break it apart. Like: this LLM judges based on style, that LLM judges based on a different aspect like factuality, truly crafting these reward signals. Some of them will be deterministic, some will be LLM-based, and that is what guides your model's behavior. Then you just throw more compute at it and watch the curves on the charts climb upwards.

Host: Do you think we'll see RL play an even more significant role in those areas that are harder to verify? Do you think LLM as a judge is sufficient?

Dmytro: It's one of the first techniques you'd start trying. Ideally, you want to figure out what the real outcome is, what the actual metric is that you want to capture. So, trying to approximate that metric is one path; trying to build bigger simulation environments is another, right? If you can simulate more of your product, more of your environment, you often get that end metric you care about, it's just harder to capture. If you can figure out how to capture it, that's great. As for the role of your domain experts, experts remain absolutely essential, because it's crucial that someone designs these tasks and actually encodes the product experience into them. We've gone through Software 1.0, 2.0, 3.0, right? In the past, we wrote software directly; then we moved to writing training data; and now, you're essentially writing evaluation rules. But this is still incredibly important. You need to look at examples, look at data, see where your product is failing, and how to guide the model towards the right behavior.

An RL Environment Has Three Parts, and Cursor Built the Entire Virtual Machine Stack Itself

Host: I want to ask about RL environments, which might tie into what you just talked about. There seems to be a big explosion now, with some companies doing RL environments seeing their revenue skyrocketing. What exactly are they providing that's really useful? Because I imagine, using Cursor as an example, you already have massive amounts of data on how customers actually use your environment. So, what can an RL environment provider give you on top of what you already have?

Federico: We actually don't use any environment providers' products. I think building an environment that works well is very hard. For those who can't access this data, it's a valuable product. However, in coding specifically, everyone has access to a vast ocean of usable coding environments. That's GitHub, right? You can go there, have a model install all the dependencies of a repo, and that's an environment you can run. I think a big part of the difficulty comes down to infrastructure. You can imagine, an environment that's effective for a specific task might need to spin up various services. Like, if you're doing a modification, say a database migration, to test if it truly works, you need to boot up the database, right? Those kinds of things are very tricky. I think these environment companies are quite helpful for that type of stuff.

Dmytro: There are really two layers to this. First, if you look at frontier labs, they are trying to build a general model that's good at everything, so they need to cover all these different underlying tasks, pack them into the same model, and encourage it to generalize. So that's one part, and in that case, they are very helpful. But in a situation like Composer, where you have your real, actual product—and I think this is what we at Fireworks also believe—if you have your own actual product, you should optimize performance for it. The most powerful environment is your own product, absolutely right. Because that's where your model will actually be used.

Of course, if you are a frontier lab, you can't do this across every product out there; but if you want to build the best model for your own product, making it specialized and custom-tailored, you should just use your production environment directly. Of course, you need to properly sandbox it, right? You don’t want the model wreaking havoc in your production database; you need to do clones and things like that. The environment companies provide some tooling, like general infrastructure, to make that easier. But overall, you want your RL environment to be as close to the real production environment as possible.

For example, from what we've seen, if you look at toy RL examples or toy-level frameworks, they always start like: "Oh, here's a toy environment, I'm going to spin up a Docker container and run everything inside." That's great if you just want to teach a model how to play Atari games or something. But to truly transition to a production case, you can't just shove a real production application into a Docker container. We ourselves found this out very early on, for instance, when working with MetaFox, running trainers on the Cursor side. And for some other customers, we run the trainer on our platform, but for the environment part, we actually default to running them on the customer's side, because that's where the actual implementation lives. You actually have the same trainer setup (even if it's part of the Fireworks platform) calling the real production environment on the customer side, rather than trying to wrap it up and componentize it.

Federico: Doing this on a managed platform is extremely difficult and introduces drifts.

What we call an "RL environment" is actually made up of three components. The first is the harness, which is where the model can submit tools and where those tools are executed. The second is what we could call the "operating system," meaning the actual world and state the model is interacting with. And the third is the reward component we need, which checks at the end if the work was completed correctly. Typically, the harness is fairly portable; you can take it to many different environments. The crucial part is the operating system, and to replicate that, ordinary containers don't actually work very well. So at Cursor, we actually built the entire virtual machine stack ourselves, so we can spin up VMs very rapidly. It has to be incredibly bursty, because you can imagine, we might ask this system: "Please give me 100,000 virtual machines right now," and they must all be up and running.

Host: That's amazing. I've thoroughly enjoyed this conversation today. I think Cursor has really set an example of how a company can evolve from an application-layer company into a true frontier model lab. I think what you've done with Composer 2 is really leading this trend. It's been truly special to hear these inside perspectives. Dmytro, hearing about the two of you fighting shoulder-to-shoulder through countless late nights, tackling those hardcore infrastructure problems from the trenches side-by-side, is just so cool—those efforts made everything possible. So, thank you. Thank you both for joining our podcast today.

Federico: Thank you so much for having us.

Dmytro: Thanks.

Reference Link:

https://www.youtube.com/watch?v=UDTr9yUnLUI

Models Are Too Fond of Cheating! Cursor Reveals the Inside Story of Composer 2's Reinforcement Learning: Models Can Detect 'Fake Environments', and Floating-Point Non-Determinism Is a Fatal Flaw in RL Training

An Order of Magnitude Cheaper! Model Weights Specialized for Coding Tasks

RL Tells the Model Its Role Is to Be an Expert and Get Things Right

An RL Environment Has Three Parts, and Cursor Built the Entire Virtual Machine Stack Itself

Related Articles

分享網址