As large language models (LLMs) see greater use in academic and commercial settings, there is increasing interest in methods that allow language models to generate texts aligned with human preferences. In this paper, we present an initial exploration of language model optimization for human preferences from direct outcome datasets, where each sample consists of a text and an associated numerical outcome measuring the reader's response. We first propose that language model optimization should be viewed as a causal problem to ensure that the model correctly learns the relationship between the text and the outcome. We formalize this causal language optimization problem, and we develop a method--causal preference optimization (CPO)--that solves an unbiased surrogate objective for the problem. We further extend CPO with doubly robust CPO (DR-CPO), which reduces the variance of the surrogate objective while retaining provably strong guarantees on bias. Finally, we empirically demonstrate the effectiveness of (DR-)CPO in optimizing state-of-the-art LLMs for human preferences on direct outcome data, and we validate the robustness of DR-CPO under difficult confounding conditions.
翻译:随着大型语言模型(LLMs)在学术和商业场景中的广泛应用,如何使语言模型生成符合人类偏好的文本已成为研究热点。本文从直接结果数据集出发,对语言模型的人类偏好优化进行初步探索,其中每个样本包含一段文本及其对应的数值化结果(用以衡量读者反馈)。我们首先提出,为确保模型正确学习文本与结果之间的因果关系,语言模型优化应被视作因果推断问题。基于此,我们形式化定义了因果语言优化问题,并开发了一种名为因果偏好优化(CPO)的方法——通过构建该问题的无偏替代目标函数加以求解。进一步地,我们提出双重稳健因果偏好优化(DR-CPO),在保持可证明的强偏差控制能力的同时,降低替代目标函数的方差。最后,通过实证研究验证了(DR-)CPO在基于直接结果数据优化先进LLMs对人类偏好匹配方面的有效性,并证明了DR-CPO在复杂混淆条件下的稳健性。