DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning

Reinforcement learning with verifiers (RLVR) is a central paradigm for improving large language model (LLM) reasoning, yet existing methods often suffer from limited exploration. Policies tend to collapse onto a few reasoning patterns and prematurely stop deep exploration, while conventional entropy regularization introduces only local stochasticity and fails to induce meaningful path-level diversity, leading to weak and unstable learning signals in group-based policy optimization. We propose DSDR, a Dual-Scale Diversity Regularization reinforcement learning framework that decomposes diversity in LLM reasoning into global and coupling components. Globally, DSDR promotes diversity among correct reasoning trajectories to explore distinct solution modes. Locally, it applies a length-invariant, token-level entropy regularization restricted to correct trajectories, preventing entropy collapse within each mode while preserving correctness. The two scales are coupled through a global-to-local allocation mechanism that emphasizes local regularization for more distinctive correct trajectories. We provide theoretical support showing that DSDR preserves optimal correctness under bounded regularization, sustains informative learning signals in group-based optimization, and yields a principled global-to-local coupling rule. Experiments on multiple reasoning benchmarks demonstrate consistent improvements in accuracy and pass@k, highlighting the importance of dual-scale diversity for deep exploration in RLVR. Code is available at https://github.com/SUSTechBruce/DSDR.

翻译：基于验证器的强化学习（RLVR）是提升大语言模型（LLM）推理能力的核心范式，但现有方法常受限于探索不足。策略往往坍缩到少数推理模式上，并过早停止深度探索，而传统的熵正则化仅引入局部随机性，无法产生有意义的路径级多样性，导致基于群体的策略优化中学习信号弱且不稳定。我们提出DSDR，一种双尺度多样性正则化强化学习框架，将LLM推理中的多样性分解为全局与耦合两个组成部分。在全局尺度上，DSDR促进正确推理轨迹之间的多样性，以探索不同的求解模式。在局部尺度上，它对正确轨迹施加长度不变的词元级熵正则化，防止各模式内部的熵坍缩，同时保持正确性。两个尺度通过一种全局到局部的分配机制进行耦合，该机制对更具区分度的正确轨迹强调局部正则化。我们提供了理论支持，表明DSDR在有界正则化下保持最优正确性，在基于群体的优化中维持信息丰富的学习信号，并产生一种原则性的全局到局部耦合规则。在多个推理基准测试上的实验表明，该方法在准确率和pass@k指标上均取得持续提升，凸显了双尺度多样性对于RLVR中深度探索的重要性。代码发布于 https://github.com/SUSTechBruce/DSDR。