Large language models (LLMs) are increasingly deployed via public-facing interfaces to interact with millions of users, each with diverse preferences. Despite this, preference tuning of LLMs predominantly relies on reward models trained using binary judgments where annotators select the preferred choice out of pairs of model outputs. In this work, we argue that this reliance on binary choices does not capture the broader, aggregate preferences of the target user in real-world tasks. We propose a taxonomy that identifies two dimensions of subjectivity where different users disagree on the preferred output-namely, the Plurality of Responses to Prompts, where prompts allow for multiple correct answers, and the Indistinguishability of Responses, where candidate outputs are paraphrases of each other. We show that reward models correlate weakly with user preferences in these cases. As a first step to address this issue, we introduce a simple yet effective method that augments existing binary preference datasets with synthetic preference judgments to estimate potential user disagreement. Incorporating these via a margin term as a form of regularization during model training yields predictions that better align with the aggregate user preferences.
翻译:大型语言模型(LLMs)正日益通过面向公众的接口部署,与数百万具有不同偏好的用户进行交互。尽管如此,LLMs的偏好调优主要依赖于基于二元判断训练的奖励模型,即标注者从模型输出对中选择偏好的选项。在本研究中,我们指出这种对二元选择的依赖无法捕捉现实任务中目标用户更广泛、更聚合的偏好。我们提出了一个分类法,识别出不同用户对偏好输出存在分歧的两个主观性维度:一是提示的响应多样性,即同一提示允许存在多个正确答案;二是响应的不可区分性,即候选输出互为释义。我们证明在这些情况下,奖励模型与用户偏好的相关性较弱。作为解决该问题的初步尝试,我们提出了一种简单而有效的方法,通过合成偏好判断来增强现有的二元偏好数据集,以估计潜在的用户分歧。在模型训练过程中,通过边界项将这些信息作为正则化形式进行整合,能够产生更符合聚合用户偏好的预测结果。