Reinforcement Learning (RL) with rubric-based rewards has recently shown remarkable progress in enhancing general reasoning capabilities of Large Language Models (LLMs), yet still suffers from ineffective exploration confined to curent policy distribution. In fact, RL optimization can be viewed as steering the policy toward an ideal distribution that maximizes the rewards, while effective exploration should align efforts with desired target. Leveraging this insight, we propose HeRL, a Hindsight experience guided Reinforcement Learning framework to bootstrap effective exploration by explicitly telling LLMs the desired behaviors specified in rewards. Concretely, HeRL treats failed trajectories along with their unmet rubrics as hindsight experience, which serves as in-context guidance for the policy to explore desired responses beyond its current distribution. Additionally, we introduce a bonus reward to incentivize responses with greater potential for improvement under such guidance. HeRL facilitates effective learning from desired high quality samples without repeated trial-and-error from scratch, yielding a more accurate estimation of the expected gradient theoretically. Extensive experiments across various benchmarks demonstrate that HeRL achieves superior performance gains over baselines, and can further benefit from experience guided self-improvement at test time. Our code is available at https://github.com/sikelifei/HeRL.
翻译:基于评分奖励的强化学习近期在增强大语言模型(LLMs)的通用推理能力方面取得了显著进展,但仍受困于当前策略分布内的低效探索问题。事实上,强化学习优化可视为引导策略趋向于最大化奖励的理想分布,而有效探索应使努力方向与目标保持一致。基于这一洞察,我们提出HeRL——一种基于后见经验引导的强化学习框架,通过向LLM明确告知奖励中指定的理想行为来启动有效探索。具体而言,HeRL将失败的轨迹及其未达成的评分标准作为后见经验,这些经验为策略提供上下文引导,使其能够探索超出当前分布的理想响应。此外,我们引入一项奖励加成,以激励在此引导下具有更大改进潜力的响应。HeRL促进了对理想高质量样本的有效学习,无需从零开始重复试错,理论上能更准确地估计期望梯度。跨多个基准的大量实验表明,HeRL相比基线取得了更优的性能提升,并能在测试时通过经验引导的自我改进进一步受益。我们的代码已开源在https://github.com/sikelifei/HeRL。