VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment

Large language models (LLMs) are increasingly applied to complex reasoning tasks that require executing several complex steps before receiving any reward. Properly assigning credit to these steps is essential for enhancing model performance. Proximal Policy Optimization (PPO), a state-of-the-art reinforcement learning (RL) algorithm used for LLM finetuning, employs value networks to tackle credit assignment. However, value networks face challenges in predicting the expected cumulative rewards accurately in complex reasoning tasks, often leading to high-variance updates and suboptimal performance. In this work, we systematically evaluate the efficacy of value networks and reveal their significant shortcomings in reasoning-heavy LLM tasks, showing that they barely outperform a random baseline when comparing alternative steps. To address this, we propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates, bypassing the need for large value networks. Our method consistently outperforms PPO and other RL-free baselines across MATH and GSM8K datasets with fewer gradient updates (up to 9x), less wall-clock time (up to 3.0x). These results emphasize the importance of accurate credit assignment in RL finetuning of LLM and demonstrate VinePPO's potential as a superior alternative.

翻译：大语言模型（LLM）越来越多地应用于需要执行多个复杂步骤才能获得奖励的复杂推理任务。正确地为这些步骤分配信用对于提升模型性能至关重要。近端策略优化（PPO）是一种用于LLM微调的最先进强化学习（RL）算法，其采用价值网络来处理信用分配问题。然而，在复杂推理任务中，价值网络难以准确预测期望累积奖励，这通常会导致高方差更新和次优性能。在本工作中，我们系统评估了价值网络的有效性，揭示了其在推理密集型LLM任务中的显著缺陷，结果表明在比较替代步骤时，其性能几乎不优于随机基线。为解决此问题，我们提出VinePPO，这是一种利用语言环境灵活性计算无偏蒙特卡洛估计的简洁方法，从而绕过了对大型价值网络的需求。我们的方法在MATH和GSM8K数据集上，以更少的梯度更新（高达9倍）和更短的挂钟时间（高达3.0倍），持续优于PPO及其他无强化学习基线。这些结果强调了在LLM的RL微调中准确信用分配的重要性，并证明了VinePPO作为一种更优替代方案的潜力。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日