As large language models (LLMs) see greater use in academic and commercial settings, there is increasing interest in methods that allow language models to generate texts aligned with human preferences. In this paper, we present an initial exploration of language model optimization for human preferences from direct outcome datasets, where each sample consists of a text and an associated numerical outcome measuring the reader's response. We first propose that language model optimization should be viewed as a causal problem to ensure that the model correctly learns the relationship between the text and the outcome. We formalize this causal language optimization problem, and we develop a method--causal preference optimization (CPO)--that solves an unbiased surrogate objective for the problem. We further extend CPO with doubly robust CPO (DR-CPO), which reduces the variance of the surrogate objective while retaining provably strong guarantees on bias. Finally, we empirically demonstrate the effectiveness of (DR-)CPO in optimizing state-of-the-art LLMs for human preferences on direct outcome data, and we validate the robustness of DR-CPO under difficult confounding conditions.
翻译:随着大型语言模型(LLM)在学术和商业场景中得到更广泛的应用,如何使语言模型生成符合人类偏好的文本日益受到关注。本文首次探讨了基于直接结果数据集的语言模型人类偏好优化问题,其中每个样本包含一段文本及其对应的数值化结果,用于衡量读者的反馈。我们首先提出,语言模型优化应被视为因果问题,以确保模型能正确学习文本与结果之间的关联。我们形式化了这一因果语言优化问题,并提出一种方法——因果偏好优化(CPO)——通过求解该问题的无偏代理目标来实现优化。我们进一步将CPO扩展为双重稳健因果偏好优化(DR-CPO),该方法能在保持偏差可证明强约束的同时降低代理目标的方差。最后,我们在直接结果数据上通过实验验证了(DR-)CPO在优化前沿LLM人类偏好方面的有效性,并证明了DR-CPO在复杂混杂条件下的鲁棒性。