Aligning large language models (LLMs) with diverse human preferences is critical for ensuring fairness and informed outcomes when deploying these models for decision-making. In this paper, we seek to uncover fundamental statistical limits concerning aligning LLMs with human preferences, with a focus on the probabilistic representation of human preferences and the preservation of diverse preferences in aligned LLMs. We first show that human preferences can be represented by a reward model if and only if the preference among LLM-generated responses is free of any Condorcet cycle. Moreover, we prove that Condorcet cycles exist with probability converging to one exponentially fast under a general probabilistic preference model called the Luce model, thereby demonstrating the impossibility of fully aligning human preferences using reward-based approaches such as reinforcement learning from human feedback. Next, we explore the conditions under which LLMs would employ mixed strategies -- meaning they do not collapse to a single response -- when aligned in the limit using a non-reward-based approach, such as Nash learning from human feedback. We identify a necessary and sufficient condition for mixed strategies: the absence of a response that is preferred over all others by a majority. As a blessing, we prove that this condition holds with high probability under the Luce model, thereby highlighting the statistical possibility of preserving minority preferences without explicit regularization in aligning LLMs.
翻译:将大型语言模型与多样化的人类偏好对齐,对于确保这些模型在决策中部署时的公平性和信息充分的结果至关重要。本文致力于揭示对齐语言模型与人类偏好相关的根本统计极限,重点关注人类偏好的概率表示以及对齐语言模型中多样化偏好的保留。我们首先证明,人类偏好可通过奖励模型表示当且仅当语言模型生成的响应间偏好不存在任何孔多塞循环。此外,我们证明了在一种称为卢斯模型的一般概率偏好模型下,孔多塞循环存在的概率以指数速度收敛到一,从而证明了基于奖励的方法(如基于人类反馈的强化学习)无法完全对齐人类偏好。接下来,我们探究了语言模型在使用非奖励方法(如基于人类反馈的纳什学习)进行极限对齐时采用混合策略的条件——即它们不会坍缩为单一响应。我们识别出混合策略的充要条件:不存在多数人偏好的优于所有其他响应的响应。令人欣慰的是,我们证明该条件在卢斯模型下以高概率成立,从而突出了在语言模型对齐中无需显式正则化即可保留少数偏好的统计可能性。