In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards -- a challenging scenario in traditional deep reinforcement learning. Despite the great successes of PPO in the alignment of state-of-the-art closed-source large language models (LLMs), its open-source implementation is still largely sub-optimal, as widely reported by numerous research studies. To address these issues, we introduce a framework that models RLHF problems as a Markov decision process (MDP), enabling the capture of fine-grained token-wise information. Furthermore, we provide theoretical insights that demonstrate the superiority of our MDP framework over the previous sentence-level bandit formulation. Under this framework, we introduce an algorithm, dubbed as Reinforced Token Optimization (\texttt{RTO}), which learns the token-wise reward function from preference data and performs policy optimization based on this learned token-wise reward signal. Theoretically, \texttt{RTO} is proven to have the capability of finding the near-optimal policy sample-efficiently. For its practical implementation, \texttt{RTO} innovatively integrates Direct Preference Optimization (DPO) and PPO. DPO, originally derived from sparse sentence rewards, surprisingly provides us with a token-wise characterization of response quality, which is seamlessly incorporated into our subsequent PPO training stage. Extensive real-world alignment experiments verify the effectiveness of the proposed approach.
翻译:在经典的基于人类反馈的强化学习(RLHF)框架中,通常采用近端策略优化(PPO)来从稀疏的句子级奖励中学习——这在传统深度强化学习中是一个具有挑战性的场景。尽管 PPO 在尖端闭源大语言模型(LLM)的对齐方面取得了巨大成功,但正如众多研究报告广泛指出的那样,其开源实现仍然在很大程度上未达最优。为了解决这些问题,我们引入了一个将 RLHF 问题建模为马尔可夫决策过程(MDP)的框架,从而能够捕获细粒度的令牌级信息。此外,我们提供了理论见解,证明了我们的 MDP 框架相较于先前句子级赌博机公式的优越性。在此框架下,我们引入了一种名为强化令牌优化(\texttt{RTO})的算法,该算法从偏好数据中学习令牌级奖励函数,并基于此学习到的令牌级奖励信号执行策略优化。从理论上证明,\texttt{RTO} 能够以样本高效的方式找到接近最优的策略。在实际实现中,\texttt{RTO} 创新性地整合了直接偏好优化(DPO)和 PPO。DPO 最初源于稀疏的句子奖励,却出人意料地为我们提供了响应质量的令牌级表征,这一表征被无缝地整合到我们后续的 PPO 训练阶段中。大量的现实世界对齐实验验证了所提方法的有效性。