Large language models distill broad knowledge from text corpora. However, they can be inconsistent when it comes to completing user specified tasks. This issue can be addressed by finetuning such models via supervised learning on curated datasets, or via reinforcement learning. In this work, we propose a novel offline RL method, implicit language Q-learning (ILQL), designed for use on language models, that combines both the flexible utility maximization framework of RL algorithms with the ability of supervised learning to leverage previously collected data, as well as its simplicity and stability. Our method employs a combination of value conservatism alongside an implicit dataset support constraint in learning value functions, which are then used to guide language model generations towards maximizing user-specified utility functions. In addition to empirically validating ILQL, we present a detailed empirical analysis of situations where offline RL can be useful in natural language generation settings, demonstrating how it can be a more effective utility optimizer than prior approaches for end-to-end dialogue, and how it can effectively optimize high variance reward functions based on subjective judgement, such as whether to label a comment as toxic or not.
翻译:大型语言模型从文本语料库中提炼出广泛知识,但在完成用户指定任务时可能存在不一致性。这一问题可通过在有标签数据集上进行监督学习或强化学习微调来解决。本文提出一种专为语言模型设计的全新离线强化学习方法——隐式语言Q学习(ILQL),该方法融合了强化学习算法的灵活效用最大化框架与监督学习利用历史数据的优势及其简洁稳定性。我们采用值保守性与隐式数据支持约束相结合的策略来学习值函数,进而引导语言模型生成最大化用户指定效用函数的内容。除实证验证ILQL有效性外,我们详细分析了离线强化学习在自然语言生成场景中的实用价值,展示其相比传统方法在端到端对话中的更优效用优化能力,以及如何有效优化基于主观判断的高方差奖励函数(如评论毒性分类任务)。