RLVR Reinforcement Learning Training Costs Plummet 98%! 12 PEFT Methods Head-to-Head, Results Are Surprising...

Recently, the performance of large language models (LLMs) in complex tasks such as mathematical reasoning has become increasingly impressive. Particularly, the training paradigm known as "Reinforcement Learning with Verifiable Rewards" (RLVR) has become the mainstream method for further enhancing model reasoning capabilities. Simply put, it involves having the model solve math problems, rewarding it for correct answers and withholding rewards for incorrect ones, thereby making the model smarter through this process.

However, there is a problem—reinforcement learning training is extremely "expensive" and requires massive computing power. To reduce costs, researchers typically use Parameter-Efficient Fine-Tuning (PEFT) methods, with LoRA (Low-Rank Adaptation) being the most popular. LoRA's core idea is simple: instead of updating all parameters, only train a small number of low-rank matrices to achieve decent results.

But here's the key question: Is it really because LoRA is the most suitable for reinforcement learning that it is used by default?

The author team of this paper (from institutions such as Zhejiang University, Hong Kong University of Science and Technology, and Brown University) focused on this issue. They discovered that although various variants of LoRA are emerging endlessly, in reinforcement learning scenarios, almost everyone is still using the most primitive standard LoRA. This is too strange—clearly, in supervised learning scenarios, variants like DoRA and AdaLoRA have been proven to be stronger than standard LoRA, so why has no one systematically studied this in reinforcement learning?

Therefore, the authors raised the core research question: Which parameter-efficient method is most suitable for reinforcement learning?

To answer this question, they built a large-scale evaluation benchmark and tested over 12 PEFT methods on the DeepSeek-R1-Distill model family (1.5B and 7B parameter scales), covering mathematical reasoning tasks such as MATH-500 and AIME24/25.

The paper's three core findings are very counter-intuitive:

Structural variants completely outperform standard LoRA: DoRA, MiSS, AdaLoRA, and other structural variants consistently outperform standard LoRA, with DoRA even surpassing full-parameter fine-tuning!
SVD initialization leads to "catastrophic collapse": Initialization strategies based on Singular Value Decomposition (SVD), such as PiSSA and MiLoRA, completely fail in reinforcement learning. The authors revealed the mechanism behind this through spectral analysis: these methods force updates in the principal components, but reinforcement learning needs to learn in the "non-principal component" space, making the two fundamentally incompatible.
Parameter compression is not the tighter, the better: Extreme compression methods like VeRA and Rank-1 adapters severely limit the model's "plasticity," leading to performance collapse. Reinforcement learning requires a certain lower bound of expressive capacity; compressing too much prevents the model from learning anything.

Related Work: The "Past and Present" of RLVR and PEFT

RLVR: "Training" Models with Verifiers

Traditional RLHF (Reinforcement Learning from Human Feedback) requires manual annotation, which is costly. RLVR, on the other hand, changes the approach: for tasks with definitive answers like math problems and code, it directly uses a rule verifier (e.g., checking if a math answer is correct) to give rewards. The core algorithm of this method is GRPO (Group Relative Policy Optimization).

The working principle of GRPO is: give the model a problem, have it generate multiple answers (e.g., 8), then calculate the "advantage value" using the rewards of this group of answers to determine which answers should be reinforced and which should be suppressed. The objective function looks roughly like this:

Later, improved versions like DAPO and Dr. GRPO emerged, mainly addressing issues such as training instability and low sample efficiency. For example, DAPO introduces an "asymmetric clipping" strategy, making it easier to boost low-probability exploration tokens; Dr. GRPO removes length normalization to avoid the model's preference for "long but wrong" answers.

The PEFT Family: Showing Off Their Skills

PEFT methods can be divided into several categories:

Baseline methods: Full-parameter fine-tuning (performance ceiling) and standard LoRA (efficiency baseline). LoRA's core formula is:

where is the frozen pre-trained weight, and are low-rank matrices, and is the rank (usually much smaller than the original dimension).

Structural variants: These methods change LoRA's architecture design. For example, DoRA decomposes the weight update into "direction" and "magnitude" parts; AdaLoRA dynamically adjusts the rank using an SVD-like structure; MiSS allocates parameters through sub-network selection.

Initialization strategies: Retain the LoRA architecture but change the initialization method. PiSSA and MiLoRA use SVD to decompose , then initialize the adapter using principal or minor components; LoRA+ sets different learning rates for and .

Extreme compression: To save memory, LoRA-FA freezes and only trains ; VeRA goes further, freezing even the low-rank matrices and only training scaling vectors.

Other PEFT: For example, LayerNorm Tuning only tunes normalization layer parameters; IA³ scales activation values through element-wise multiplication.

Core Method: How to "Fairly Compete" These 12 PEFTs?

Experimental Design

To ensure the fairness and reliability of the evaluation, the authors put in a lot of effort:

Model Selection: Used the 1.5B and 7B models from the DeepSeek-R1-Distill series. These models have already undergone supervised fine-tuning (SFT), possessing basic reasoning capabilities and a standard output format (placing the reasoning process inside the <think> tag and the final answer inside \boxed{}).

Dataset: Used the DAPO-Math-17k dataset (approximately 17,400 high-quality math problems) and evaluated on six mathematical reasoning benchmarks: AIME24/25, MATH-500, and AMC.

Training Configuration: Unified hyperparameters—learning rate 1e-5, LoRA rank 32, dropout 0.05. Generated 8 answers per problem, trained using the DAPO algorithm. The 1.5B model was trained for 1024 steps (batch size 128), and the 7B model was trained for 8192 steps (batch size 32).

Reward Mechanism: A very strict binary reward—1 point for a completely correct answer, 0 otherwise. Used latex2sympy and math_verify to verify mathematical equivalence.

Evaluation Metrics: To cope with statistical fluctuations in benchmarks with small sample sizes like AIME, the Avg@k metric (average accuracy of k generations) and Pass@1 (at least one correct in k generations) were used.

Four Dimensions of Ablation Experiments

To verify the robustness of the findings, the authors conducted comprehensive ablation studies:

Batch Size (32 vs 128): Test whether PEFT methods prefer small batches like SFT.
Algorithm Variants (GRPO, DAPO, Dr. GRPO): Verify if the conclusions depend on specific algorithms.
Learning Rate (1e-5, 5e-6, 1e-6): Confirm the optimal learning rate range.
LoRA Rank (1, 8, 16, 32): Explore the relationship between rank and performance.

Experimental Results: Three Findings That Overturn Perceptions

Findings 1: LoRA is "Obsolete," Structural Variants are the True Kings

The experimental results were shocking from the start: Standard LoRA (42.5%) lagged behind full-parameter fine-tuning (44.9%), while the structural variants shone brightly:

DoRA: Average accuracy of 46.6%, not only surpassing LoRA but even exceeding full-parameter fine-tuning! It reached 39.0% on AIME24 and 71.9% on AMC.
AdaLoRA: 44.2%, stably surpassing LoRA.
MiSS: 43.4%, also performing better than LoRA.

Why is this? The authors believe that standard LoRA's low-rank constraint is too "rigid" and cannot cope with the complex strategy adjustment needs in reinforcement learning. DoRA, through decoupling magnitude and direction; AdaLoRA, through adaptive rank allocation; and MiSS, through parameter sharding, all provide a more flexible optimization space, better fitting the optimization dynamics of RLVR.

Core Conclusion: Stop blindly using standard LoRA! In reinforcement learning scenarios, structural variants are the optimal choice.

Findings 2: The "Catastrophic Collapse" of SVD Initialization

This finding is particularly interesting. Theoretically, SVD initialization methods like PiSSA and MiLoRA should be quite reasonable:

PiSSA: Initializes with principal components (large singular values), assuming important information is concentrated in the principal components.
MiLoRA: Does the opposite, initializing with minor components (small singular values), believing this can retain more "new" information.

The result? PiSSA collapsed directly to 0.2% accuracy, and MiLoRA was only 18.0%. What happened?

The authors conducted in-depth spectral analysis (see Figure 3). The key finding is that reinforcement learning updates occur mainly in the "non-principal component" space. Recent research (Zhu et al., 2025) reveals that RLVR, to protect the knowledge structure of the pre-trained model, avoids high-curvature principal components and learns in the low-curvature non-principal component subspace.

Why PiSSA fails: It forces updates in the principal components, directly conflicting with RLVR's "non-principal component preference," leading to training collapse.
Why MiLoRA fails: Although initialized in the minor components, the singular values of these components are too small (close to zero), making the initial adapter almost zero. Without sufficient initial bias, the gradient automatically "slides" towards the principal component direction (because the gradient follows the direction of maximum variance), eventually degrading into principal component updates.

The spectral analysis plot shows that MiLoRA's final update distribution is almost identical to PiSSA's, with obvious spikes in the principal components, whereas the full-parameter fine-tuning update is uniformly distributed across the entire spectrum.

Core Conclusion: SVD-based initialization strategies are not suitable for RLVR. If you want to improve initialization, you should adjust learning rate dynamics like LoRA+ instead of playing with SVD decomposition tricks.

Findings 3: The "Expressiveness Floor" of Parameter Compression

Many people might think that fewer parameters are better—saving both memory and computing power. But the experiments reveal a harsh truth: Reinforcement learning has a lower bound requirement for the number of parameters.

Comparing the trainable parameter ratios of different methods:

Full parameters: 100% (accuracy 44.9%)
LoRA: 1.55% (accuracy 42.5%)
MiSS: 0.99% (accuracy 43.4%)—compressed to 2/3 of LoRA, yet performance is slightly better
LoRA-FA: Freezes half the parameters (accuracy 43.0%)—still acceptable
VeRA: 0.0029% (accuracy 40.7%)—collapsed
IA³: Only tunes scaling vectors (accuracy 22.3%)—collapsed even more
LN Tuning: Only tunes normalization layers (accuracy 41.8%)—barely usable but significantly weaker
Rank-1 LoRA (accuracy 40.5%)—same as the baseline model, equivalent to no training

Why is this? The authors explain that the supervision signal in reinforcement learning is sparse (only 0 or 1 reward signals), unlike the dense token-level feedback in supervised learning. This sparse signal requires sufficient parameter space to "carry" complex strategy adjustments. Extreme compression methods (like VeRA only training scaling vectors) create an "information bottleneck," severely limiting the model's ability to learn reasoning behaviors.

Core Conclusion: Parameter efficiency does not mean fewer parameters are better. Find a balance between efficiency and expressiveness; don't "starve" the model too much.

Ablation Experiments: The Conclusion is Rock-Solid

The ablation experiments conducted by the authors are also insightful:

Batch Size: Unlike SFT, RLVR is not very sensitive to batch size. Small batches (32) are slightly better than large batches (128), but the gap is not large. This may be because the sparse rewards in reinforcement learning do not "overload" the adapter capacity like the dense signals in SFT.

Algorithm Variants: Whether using GRPO, DAPO, or Dr. GRPO, the conclusions are consistent—indicating that the pros and cons of PEFT methods are determined by the fundamental characteristic of "sparse verifiable rewards," not by specific algorithm details.

Learning Rate: Verified the previous scaling law—the optimal learning rate is approximately. Too large is unstable, too small is hard to learn.

LoRA Rank: Rank=32 > Rank=16 > Rank=8 >> Rank=1. Don't believe the nonsense that "Rank=1 is enough." Maintaining a moderate rank (16-32) is the way to go.

7B Model Validation: The Conclusion is Scalable

To prove that the findings are not a special phenomenon of the 1.5B small model, the authors repeated the experiments on the 7B model. The results were very consistent:

LoRA: 54.8%
DoRA: 55.0%
LoRA+: 55.5% (best)
MiSS: 53.4%

DoRA and LoRA+ still stably outperform standard LoRA, indicating that the advantages of structural optimization and learning rate adjustment remain effective in large models.

Paper Summary: Pointing a Clear Path for Reinforcement Learning Training

This paper has accomplished a very solid piece of work: It is the first to systematically evaluate the performance of PEFT methods in reinforcement learning. The three findings point us in the right direction:

Stop using standard LoRA; switch to structural variants: DoRA, MiSS, and AdaLoRA are significantly stronger in RLVR scenarios, with DoRA even surpassing full-parameter fine-tuning. If you are still training reinforcement learning models with standard LoRA, it's time to upgrade your toolbox.
Avoid the pitfall of SVD initialization: PiSSA and MiLoRA fail in reinforcement learning because they fundamentally conflict with the "non-principal component update" characteristic of RLVR. If you want to optimize initialization, learn from LoRA+ to adjust learning rates instead of tinkering with SVD decomposition.
Maintain a moderate amount of parameters: Extreme compression (VeRA, IA³, Rank-1) will "starve" the model to the point where it cannot learn. Reinforcement learning's sparse signals require sufficient expressive capacity; don't sacrifice performance to save a bit of memory.

The authors also honestly pointed out future work directions: migrating to higher-performance training frameworks (like VeRL), deeply studying the theoretical mechanisms of adapter dynamics, extending to multimodal and long-term training scenarios, and solving numerical stability issues in weight merging, etc.

Last but not least, this paper provides a "PEFT Selection Guide" for the reinforcement learning community: If you are training models based on verifier feedback, such as for math reasoning and code generation, DoRA is the first choice, LoRA+ is the second choice, and standard LoRA can only be considered "usable but not good enough." As for SVD initialization and extreme compression methods, please avoid them directly. This guide is worth collecting for every researcher and engineer working on RLVR!