Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: policies prematurely concentrate on a small set of dominant reasoning patterns, improving pass@1 while limiting rollout-level diversity and gains in pass@k. We argue that this failure stems from regularizing local token behavior rather than diversity over sets of solutions. To address this, we propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that explicitly rewards correct solutions that exhibit rare high-level strategies. Our method uses an LLM-based judge to cluster rollouts for the same problem according to their high-level solution strategies, ignoring superficial variations, and reweights policy advantages inversely with cluster size. As a result, correct but novel strategies receive higher rewards than redundant ones. Across mathematics, physics, and medical reasoning benchmarks, our approach consistently improves pass@$k$ across large sampling budgets and increases the area under the pass@$k$ curve (AUC@$K$) without sacrificing pass@1, while sustaining exploration and uncovering more diverse solution strategies at scale.
翻译:强化学习已成为大型语言模型后训练的核心范式,尤其在复杂推理任务中,但其常受探索崩溃问题困扰:策略过早集中于少数主导推理模式,虽能提升pass@1指标,却限制了轨迹级多样性并制约了pass@$k$的增益。我们认为这一缺陷源于对局部词元行为的常规化约束,而非对解集多样性的考量。为此,我们提出独特性感知强化学习——一种轨迹级优化目标,通过显式奖励采用罕见高层策略的正确解法。该方法基于LLM的评判器对同一问题的轨迹进行高层解策略聚类(忽略表面差异),并依据聚类规模对策略优势进行反权重调整,从而使正确但新颖的策略比冗余策略获得更高奖励。在数学、物理及医学推理基准测试中,本方法在大规模采样预算下持续提升pass@$k$指标,在保持pass@1性能的同时显著提高pass@$k$曲线下面积,并能维持探索过程,在大规模应用中发掘更多样化的解策略。