Traditional on-policy Reinforcement Learning with Verifiable Rewards (RLVR) frameworks suffer from experience waste and reward homogeneity, which directly hinders learning efficiency on difficult samples during large language models post-training. In this paper, we introduce Batch Adaptation Policy Optimization (BAPO), an off-policy RLVR framework to improve the data efficiency in large language models post-training. It dynamically selects training batches by re-evaluating historically difficult samples and reusing high-quality ones, while holding a lower bound guarantee for policy improvement. Extensive experiments further demonstrate that BAPO achieves an average 12.5% improvement over GRPO across mathematics, planning, and visual reasoning tasks. Crucially, BAPO successfully resolves 40.7% of problems that base models consistently fail to solve.
翻译:传统的基于可验证奖励的在线策略强化学习(RLVR)框架存在经验浪费和奖励同质化问题,这直接阻碍了大语言模型后训练阶段在困难样本上的学习效率。本文提出批量自适应策略优化(BAPO),这是一种离线策略RLVR框架,旨在提升大语言模型后训练的数据效率。该框架通过重新评估历史困难样本并复用高质量样本的方式动态选择训练批次,同时为策略改进提供下界保证。大量实验进一步表明,在数学、规划和视觉推理任务上,BAPO相比GRPO平均实现了12.5%的性能提升。尤为关键的是,BAPO成功解决了基础模型持续无法解决的40.7%的问题。