Reinforcement learning with verifiable rewards (RLVR) has significantly boosted the reasoning capability of language models (LMs) recently. However, existing RLVR approaches merely train LMs based on their own generated on-policy responses and are constrained by the initial capability of LMs, thus prone to exploration stagnation, in which LMs fail to solve more training problems and cannot further learn from the training data. Some work tries to address this by leveraging off-policy solutions to training problems, but relies on external expert guidance that is limited in availability and scalability. In this work, we propose LTE (Learning to reason from Trial and Error), an approach that hints LMs with their previously self-made mistakes, not requiring any external expert guidance. Experiments validate the effectiveness of LTE, which outperforms the normal group relative policy optimization (GRPO) by 5.02 in Pass@1 and 9.96 in Pass@k on average across six mathematical reasoning benchmarks for Qwen3-8B-Base and even performs better than methods that require external gold solutions as guidance after aligning the experimental setup. Further analysis confirms that LTE successfully mitigates exploration stagnation and enhances both exploitation and exploration during training. Our code is available at https://anonymous.4open.science/r/Learning-from-Trial-and-Error.
翻译:可验证奖励的强化学习(RLVR)近期显著提升了语言模型(LM)的推理能力。然而,现有的RLVR方法仅基于语言模型自身生成的同策略响应进行训练,受限于语言模型的初始能力,容易陷入探索停滞状态,即语言模型无法解决更多训练问题,也无法从训练数据中进一步学习。部分工作尝试通过利用训练问题的离策略解来解决此问题,但依赖于外部专家指导,其可用性和可扩展性有限。本文提出LTE(从试错中学习推理),该方法通过提示语言模型其先前自身所犯错误来引导学习,无需任何外部专家指导。实验验证了LTE的有效性:在六个数学推理基准测试中,Qwen3-8B-Base模型上LTE平均在Pass@1和Pass@k指标上分别优于常规组相对策略优化(GRPO)5.02和9.96分;在统一实验设置后,其表现甚至优于需要外部黄金解作为指导的方法。进一步分析证实,LTE成功缓解了探索停滞,并增强了训练过程中的利用与探索。我们的代码公开于https://anonymous.4open.science/r/Learning-from-Trial-and-Error。