We propose a way to optimize chain-of-thought with reinforcement learning, but without external reward function. Our algorithm relies on viewing chain-of-thought as latent variable as part of a probabilistic inference problem. Contrary to the full evidence lower bound, we propose to apply a much simpler Jensen's lower bound, which derives tractable objectives with simple algorithmic components (e.g., without the need for parametric approximate posterior), making it more conducive to modern large-scale training. The lower bound approach naturally interpolates other methods such as supervised fine-tuning and online reinforcement learning, whose practical trade-offs we will illustrate. Finally, we show that on mathematical reasoning problems, optimizing with Jensen's lower bound is as effective as policy gradient with external reward. Taken together, our results showcase as a proof of concept to this new algorithmic paradigm's potential to more generic applications.
翻译:我们提出了一种通过强化学习优化思维链的方法,但无需外部奖励函数。该算法将思维链视为概率推断问题中的隐变量进行处理。与完整证据下界不同,我们提出采用更简洁的Jensen下界,该下界推导出的目标函数具有可处理性且算法组件简单(例如无需参数化近似后验分布),使其更适用于现代大规模训练。该下界方法自然衔接了监督微调和在线强化学习等其他方法,我们将具体说明其实际权衡关系。最后,我们通过数学推理问题证明,采用Jensen下界进行优化与使用外部奖励的策略梯度方法具有同等效力。综合而言,我们的研究成果为这一新算法范式在更广泛应用场景中的潜力提供了概念验证。