A centerpiece of the ever-popular reinforcement learning from human feedback (RLHF) approach to fine-tuning autoregressive language models is the explicit training of a reward model to emulate human feedback, distinct from the language model itself. This reward model is then coupled with policy-gradient methods to dramatically improve the alignment between language model outputs and desired responses. In this work, we adopt a novel perspective wherein a pre-trained language model is itself simultaneously a policy, reward function, and transition function. An immediate consequence of this is that reward learning and language model fine-tuning can be performed jointly and directly, without requiring any further downstream policy optimization. While this perspective does indeed break the traditional agent-environment interface, we nevertheless maintain that there can be enormous statistical benefits afforded by bringing to bear traditional algorithmic concepts from reinforcement learning. Our experiments demonstrate one concrete instance of this through efficient exploration based on the representation and resolution of epistemic uncertainty. In order to illustrate these ideas in a transparent manner, we restrict attention to a simple didactic data generating process and leave for future work extension to systems of practical scale.
翻译:在基于人类反馈的强化学习(RLHF)方法中,对自回归语言模型进行微调的核心环节是显式训练一个独立于语言模型本身的奖励模型,以模拟人类反馈。该奖励模型随后与策略梯度方法相结合,显著提升语言模型输出与期望响应之间的对齐程度。本文采用一种新颖视角:将预训练语言模型本身同时视为策略、奖励函数和转移函数。这一视角的直接推论是:奖励学习与语言模型微调可以联合且直接进行,无需任何后续的策略优化。尽管这一视角确实打破了传统的智能体-环境接口,但我们坚持认为,借鉴强化学习中的传统算法概念能够带来巨大的统计优势。我们的实验通过基于认知不确定性表征与消解的高效探索,展示了这一理念的具体实例。为清晰阐明这些思想,我们将研究范围限定于一个简单的教学性数据生成过程,并将扩展到实用规模系统的相关工作留待未来开展。