In recent years, the focus of improving large language model capabilities has been shifting from 'training-time scaling' to 'inference-time scaling'. From Best-of-N and Self-Consistency to more complex search and verification frameworks, Test-Time Scaling has become a crucial paradigm for enhancing the complex reasoning abilities of large models.
However, a long-overlooked issue is that most of these methods assume models are autoregressive in generation.
For Discrete Diffusion Language Models (dLLMs), the situation is entirely different. A dLLM does not generate token by token from left to right; instead, it starts from a masked sequence and gradually recovers the complete answer through a multi-step denoising process. This parallel, non-autoregressive generation approach inherently possesses global bidirectional context and is more suitable for planning and self-correction. But simultaneously, traditional tree search, process reward models, and Best-of-N inference designed for autoregressive models cannot be directly and efficiently adapted.
To address this issue, the paper proposes PRISM: Pruning, Remasking, and Integrated Self-verification Method, an efficient Test-Time Scaling framework specifically designed for discrete diffusion language models. Its core objective is clear: not to simply have the model 'run a few more times', but to identify more promising trajectories during the denoising process, dynamically prune, create local branches, and use the model itself for lightweight verification, thereby approaching or even surpassing the performance of Best-of-N at a lower inference budget.
Paper Title: Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models
arXiv Address: https://arxiv.org/abs/2602.01842
Code Address: https://github.com/viiika/Prism
Traditional Best-of-N is too expensive; PRISM redesigns inference search for dLLMs.
For dLLMs, the cost of naive Best-of-N is very direct: if you sample N trajectories and each trajectory requires T denoising steps, the total number of function calls is O(NT). This means all candidate answers must be fully completed, even though many trajectories are clearly of poor quality midway through, they still consume the full budget.
The key insight of PRISM is to divide the inference process into three stages: early random exploration, mid-stage progressive pruning, and late-stage refinement.
In the high-noise stage, the model's output is still unstable, so PRISM maintains a wide candidate set to preserve diversity. During the early-to-mid denoising window, when the 'logical skeleton' of the answer begins to form, PRISM uses self-verification signals to prune low-quality trajectories and reallocates computational resources to more promising candidates. Finally, only a small number of trajectories are retained to continue the refinement process. The paper calls this process Hierarchical Trajectory Search (HTS).
This design makes the actual complexity of PRISM approach O(N + KT), where K is the smaller width of candidates finally retained. Compared to the O(NT) of traditional Best-of-N, this is akin to changing from 'running all routes to the end' to 'broadly exploring first, then concentrating firepower'.
Not starting over, but branching locally on low-confidence tokens.
The second key component of PRISM is Local Branching via Partial Remasking. Intuitively, by the mid-stage of denoising, the model has already formed some high-confidence tokens, which often correspond to the stable structure or logical skeleton of the answer. Meanwhile, low-confidence tokens may correspond to uncertain reasoning details, implementation methods, or local expressions.
PRISM does not crudely discard an entire trajectory and resample; instead, it retains the high-confidence parts, only re-masks the low-confidence positions, and then generates new branches from these local changes. The benefit is that it preserves the existing high-quality structure while continuing to explore different detail implementations, avoiding premature convergence to a single path. Figure 2 in the paper provides an intuitive demonstration of this process: during the progressive pruning stage, PRISM branches locally around high-scoring trajectories and generates new candidates through partial remasking.
No longer relying on external verifiers: the model scores itself.
Many Test-Time Scaling methods rely on an additional reward model or verifier to judge the quality of candidate answers. But this introduces significant system overhead: deploying requires loading an extra model, which increases GPU memory consumption, latency, and engineering complexity.
PRISM proposes Self-Verified Feedback (SVF): directly reusing the same dLLM as a binary classification verifier. Specifically, the model first generates a complete candidate answer based on the intermediate denoising state, then constructs a Yes/No verification prompt, letting the model judge whether the answer is likely correct. PRISM converts the logits of Yes and No into a binary normalized score, used for trajectory ranking, pruning, and final selection.
The value of this design is that it transforms verification from an 'extra model' into a 'lightweight self-check of the same model'. The paper further points out that the number of SVF calls is minimal compared to denoising NFE (Neural Function Evaluations), typically less than 10% of the total NFE in experiments, thus providing an effective search signal at relatively low additional cost.
Experiments: Achieving significant cost-performance improvements in mathematical reasoning and code generation.
The paper evaluates PRISM on four benchmarks: mathematical reasoning tasks GSM8K and MATH-500, and code generation tasks HumanEval and MBPP. The experiments cover three discrete diffusion language models: LLaDA-8B-Instruct, Dream-7B-Instruct, and LLaDA-2.0-mini.
On LLaDA-8B-Instruct, PRISM (K=8) improved GSM8K from 67.58% to 85.30%, and MATH-500 from 26.40% to 42.80%. On code tasks, HumanEval improved by 24.39 percentage points and MBPP by 16.40 percentage points. More importantly, these gains were not achieved by linearly increasing Best-of-N computation: for instance, on GSM8K, PRISM achieved 85.30% with 1048 NFE, whereas Best-of-16 required 4096 NFE to reach 87.50%, realizing over a 4x saving in denoising computation.
In Figure 1 of the paper, PRISM demonstrates a superior performance-computation curve compared to Best-of-N across multiple tasks: at comparable accuracy levels, it shows speed advantages of 2.9×, 6.5×, 1.8×, and 1.7× on GSM8K, MATH500, HumanEval, and MBPP respectively.
The paper also compares PRISM with other inference-time scaling methods. On TruthfulQA, PRISM's ROUGE-1/2/L scores reached 31.8/35.5/31.9 with an inference time of 1048.0 seconds; in contrast, LLaDA-ReMDM scored 29.5/31.8/29.5 with an inference time of 1354.8 seconds. This indicates that PRISM not only improves task performance but also maintains better inference efficiency.
In external verifier comparisons, SVF achieved 85.30% on GSM8K, only requiring the loading of the original 8B model. Although a Qwen3-8B verifier could reach 87.35%, it requires an additional model to be loaded, bringing the total parameter count to 16B. The paper posits that the advantage of SVF is not in absolutely replacing all external verifiers, but in providing a lighter, easier-to-deploy path for dLLM inference scaling.
Significance: Opening an inference-time scaling route for non-autoregressive language models.
The core contribution of PRISM is not simply proposing a new search heuristic, but redefining how Test-Time Scaling should occur on dLLMs.
For autoregressive models, inference search typically revolves around the 'prefix'. But for discrete diffusion models, the intermediate state is a partially masked global sequence, and traditional prefix-based process rewards and tree search are not naturally applicable. PRISM reintegrates search, pruning, local branching, and self-verification back into the denoising dynamics of dLLMs: concentrating budget allocation during the structure formation stage, exploring alternative expressions in low-confidence areas, and performing verification without needing an extra model.
This means that dLLMs are no longer just an alternative paradigm that is 'faster due to parallel generation', but could also become a new type of language model architecture suitable for reasoning, planning, and self-correction. As models like LLaDA, Dream, Mercury, and Gemini Diffusion push discrete diffusion language models to larger scales, PRISM demonstrates an important direction: enabling non-autoregressive models to continuously gain capability improvements through inference-time computation, just like current mainstream LLMs.
From this perspective, PRISM is not just a more computationally efficient alternative to Best-of-N, but a critical puzzle piece in advancing discrete diffusion language models towards efficient reasoning systems.
Author Introduction
This paper was completed by researchers including Jinbin Bai. The author team has long focused on emerging generative paradigms such as discrete diffusion and masked generative modeling, with research directions covering high-resolution text-to-image synthesis, unified multimodal generation, preference alignment and inference optimization for discrete diffusion models, and interactive world models.
Previously, the team proposed Meissonic [1], exploring the potential of masked generative transformers in high-resolution text-to-image synthesis, and subsequently proposed Muddit [2], advancing discrete diffusion modeling from image generation to a more unified multimodal generation framework. PRISM, accepted at ICML 2026, extends this research lineage to the inference stage, focusing on how hierarchical search, self-verification feedback, and local remasking can enable efficient Test-Time Scaling for discrete diffusion models without requiring an external verifier.
[1] Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis, ICLR 2025, https://arxiv.org/abs/2410.08261
[2] Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model, ICLR 2026, https://arxiv.org/abs/2505.23606
© THE END
Reprint requests should be directed to this official account for authorization.
For submissions or media inquiries: liyazhou@jiqizhixin.com