Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated promising gains in enhancing the reasoning capabilities of large language models. However, its dependence on domain-specific verifiers significantly restricts its applicability to open and general domains. Recent efforts such as RLPR have extended RLVR to general domains, enabling training on broader datasets and achieving improvements over RLVR. However, a notable limitation of these methods is their tendency to overfit to reference answers, which constrains the model's ability to generate diverse outputs. This limitation is particularly pronounced in open-ended tasks such as writing, where multiple plausible answers exist. To address this, we propose DARL, a simple yet effective reinforcement learning framework that encourages the generation of diverse answers within a controlled deviation range from the reference while preserving alignment with it. Our framework is fully compatible with existing general reinforcement learning methods and can be seamlessly integrated without additional verifiers. Extensive experiments on thirteen benchmarks demonstrate consistent improvements in reasoning performance. Notably, DARL surpasses RLPR, achieving average gains of 1.3 points on six reasoning benchmarks and 9.5 points on seven general benchmarks, highlighting its effectiveness in improving both reasoning accuracy and output diversity.
翻译:基于可验证奖励的强化学习(RLVR)在提升大语言模型推理能力方面展现出显著潜力。然而,该方法对领域特定验证器的依赖严重限制了其在开放通用领域的适用性。近期研究如RLPR将RLVR扩展至通用领域,使其能够在更广泛的数据集上进行训练,并取得了优于RLVR的效果。但这些方法存在一个明显局限:容易对参考答案产生过拟合,从而制约了模型生成多样化输出的能力。这一局限在写作等开放式任务中尤为突出,因为此类任务通常存在多个合理答案。为解决该问题,我们提出DARL——一个简洁而有效的强化学习框架,该框架在保持与参考答案对齐的前提下,鼓励模型在受控的偏离范围内生成多样化答案。我们的框架与现有通用强化学习方法完全兼容,无需额外验证器即可无缝集成。在十三个基准测试上的大量实验表明,该方法在推理性能上取得了持续提升。值得注意的是,DARL超越了RLPR,在六个推理基准上平均提升1.3分,在七个通用基准上平均提升9.5分,突显了其在提升推理准确性和输出多样性方面的双重有效性。