Reinforcement Learning (RL) has become a key approach for enhancing the reasoning capabilities of large language models. However, prevalent RL approaches like proximal policy optimization and group relative policy optimization suffer from sparse, outcome-based rewards and weak exploration incentives, limiting their effectiveness. Specifically, sparse rewards offer limited feedback, especially on difficult problems, and introduce biases favoring familiar trajectories over novel reasoning paths. These issues critically undermine performance on complex tasks that inherently require iterative reasoning. To overcome these challenges, we propose Intrinsic MotivAtion Guided exploratIoN for Enhanced reasoning (IMAGINE), which delivers dense rewards and encourages exploration. IMAGINE introduces three innovations: a trajectory-aware exploration reward that reduces token-level bias efficiently; an error-conditioned reward allocation that promotes efficient exploration on hard samples while stabilizing training; and an advantage-preserving integration mechanism that retains distributional integrity during learning. Experiments on four public datasets show that IMAGINE improves performance by 22.23% on AIME 2024.
翻译:强化学习已成为提升大语言模型推理能力的关键方法。然而,诸如近端策略优化和组相对策略优化等主流强化学习方法,因依赖于稀疏的、基于结果的奖励以及微弱的探索激励而效果受限。具体而言,稀疏奖励提供的反馈有限,尤其在处理难题时,并且会引入偏向熟悉轨迹而非新颖推理路径的偏差。这些问题严重损害了在本质上需要迭代推理的复杂任务上的性能。为克服这些挑战,我们提出了用于增强推理的内在动机引导探索方法(IMAGINE),该方法提供密集奖励并鼓励探索。IMAGINE引入了三项创新:一种高效减少词元级偏差的轨迹感知探索奖励;一种在困难样本上促进高效探索同时稳定训练的误差条件奖励分配机制;以及一种在学习过程中保持分布完整性的优势保持集成机制。在四个公开数据集上的实验表明,IMAGINE在AIME 2024上的性能提升了22.23%。