Aligning language models (LMs) with preferences is an important problem in natural language generation. A key challenge is that preferences are typically provided at the sequence level while LM training and generation both occur at the token level. There is, therefore, a granularity mismatch between the preference and the LM training losses, which may complicate the learning problem. In this paper, we address this issue by developing an alternate training process, where we iterate between grounding the sequence-level preference into token-level training guidance, and improving the LM with the learned guidance. For guidance learning, we design a framework that extends the pairwise-preference learning in imitation learning to both variable-length LM generation and utilizing the preference among multiple generations. For LM training, based on the amount of supervised data, we present two minimalist learning objectives that utilize the learned guidance. In experiments, our method performs competitively on two distinct representative LM tasks -- discrete-prompt generation and text summarization.
翻译:将语言模型(LM)与偏好对齐是自然语言生成中的一个重要问题。一个关键挑战在于,偏好通常以序列级别提供,而LM的训练和生成均在令牌级别进行。因此,偏好与LM训练损失之间存在粒度不匹配,这可能使学习问题复杂化。在本文中,我们通过开发一种交替训练过程来解决这一问题,该过程在将序列级偏好转化为令牌级训练引导与利用所学引导改进LM之间迭代进行。对于引导学习,我们设计了一个框架,将模仿学习中的成对偏好学习扩展到可变长度LM生成以及利用多个生成之间的偏好。对于LM训练,基于监督数据量,我们提出了两种利用所学引导的最小化学习目标。在实验中,我们的方法在两个不同的代表性LM任务——离散提示生成和文本摘要——上表现出了竞争力。