Large Reasoning Models (LRMs) have recently achieved strong mathematical and code reasoning performance through Reinforcement Learning (RL) post-training. However, we show that modern reasoning post-training induces an unintended exploration collapse: temperature-based sampling no longer increases pass@$n$ accuracy. Empirically, the final-layer posterior of post-trained LRMs exhibit sharply reduced entropy, while the entropy of intermediate layers remains relatively high. Motivated by this entropy asymmetry, we propose Latent Exploration Decoding (LED), a depth-conditioned decoding strategy. LED aggregates intermediate posteriors via cumulative sum and selects depth configurations with maximal entropy as exploration candidates. Without additional training or parameters, LED consistently improves pass@1 and pass@16 accuracy by 0.61 and 1.03 percentage points across multiple reasoning benchmarks and models. Project page: https://GitHub.com/Xiaomi-Research/LED.
翻译:大型推理模型(LRMs)近期通过强化学习(RL)训练后处理,在数学与代码推理任务中取得了显著性能。然而,我们发现现代推理训练后处理会引发非预期的探索坍缩:基于温度的采样不再能提升 pass@$n$ 准确率。实证表明,经过训练后处理的 LRMs 其最终层后验分布的熵值急剧下降,而中间层的熵值仍保持相对较高。受此熵不对称现象的启发,我们提出潜在探索解码(LED),一种深度条件解码策略。LED 通过累积求和聚合中间层后验分布,并选择具有最大熵的深度配置作为探索候选。在无需额外训练或引入参数的情况下,LED 在多个推理基准测试和模型上,将 pass@1 和 pass@16 准确率分别平均提升了 0.61 和 1.03 个百分点。项目页面:https://GitHub.com/Xiaomi-Research/LED。