Current critic-free RL methods for large reasoning models suffer from severe inefficiency when training on positive homogeneous prompts (where all rollouts are correct), resulting in waste of rollouts due to zero advantage estimates. We introduce a radically simple yet powerful solution to \uline{M}ine \uline{in}trinsic mast\uline{er}y (Miner), that repurposes the policy's intrinsic uncertainty as a self-supervised reward signal, with no external supervision, auxiliary models, or additional inference cost. Our method pioneers two key innovations: (1) a token-level focal credit assignment mechanism that dynamically amplifies gradients on critical uncertain tokens while suppressing overconfident ones, and (2) adaptive advantage calibration to seamlessly integrate intrinsic and verifiable rewards. Evaluated across six reasoning benchmarks on Qwen3-4B and Qwen3-8B base models, Miner achieves state-of-the-art performance among the other four algorithms, yielding up to \textbf{4.58} absolute gains in Pass@1 and \textbf{6.66} gains in Pass@K compared to GRPO. Comparison with other methods targeted at exploration enhancement further discloses the superiority of the two newly proposed innovations. This demonstrates that latent uncertainty exploitation is both necessary and sufficient for efficient and scalable RL training of reasoning models.
翻译:当前面向大型推理模型的免评论家强化学习方法在训练正同质提示(所有推演均正确)时存在严重的低效问题,导致因优势估计为零而浪费推演资源。我们提出了一种极其简单却强大的解决方案——挖掘内在掌握力(Miner),该方法将策略的内在不确定性重新用作自监督奖励信号,无需外部监督、辅助模型或额外推理成本。我们的方法开创了两项关键创新:(1)一种令牌级焦点信用分配机制,能动态放大关键不确定令牌上的梯度,同时抑制过度自信的令牌;(2)自适应优势校准,以无缝整合内在奖励与可验证奖励。在Qwen3-4B和Qwen3-8B基础模型上对六个推理基准进行评估,Miner在其余四种算法中实现了最先进的性能,与GRPO相比,在Pass@1上获得高达**4.58**的绝对提升,在Pass@K上获得**6.66**的提升。与其他旨在增强探索的方法进行比较,进一步揭示了这两项新提出创新点的优越性。这表明,利用潜在不确定性对于推理模型的高效且可扩展的强化学习训练既是必要的,也是充分的。