Value alignment, which aims to ensure that large language models (LLMs) and other AI agents behave in accordance with human values, is critical for ensuring safety and trustworthiness of these systems. A key component of value alignment is the modeling of human preferences as a representation of human values. In this paper, we investigate the robustness of value alignment by examining the sensitivity of preference models. Specifically, we ask: how do changes in the probabilities of some preferences affect the predictions of these models for other preferences? To answer this question, we theoretically analyze the robustness of widely used preference models by examining their sensitivities to minor changes in preferences they model. Our findings reveal that, in the Bradley-Terry and the Placket-Luce model, the probability of a preference can change significantly as other preferences change, especially when these preferences are dominant (i.e., with probabilities near 0 or 1). We identify specific conditions where this sensitivity becomes significant for these models and discuss the practical implications for the robustness and safety of value alignment in AI systems.
翻译:价值对齐旨在确保大型语言模型(LLM)及其他人工智能代理的行为符合人类价值观,这对于保障此类系统的安全性与可信度至关重要。价值对齐的一个核心环节是将人类偏好建模为人类价值观的表征。本文通过考察偏好模型的敏感性来研究价值对齐的鲁棒性。具体而言,我们探讨:某些偏好概率的变化如何影响这些模型对其他偏好的预测?为回答此问题,我们从理论上分析了广泛使用的偏好模型对建模偏好中微小变化的敏感性。研究发现,在Bradley-Terry模型和Placket-Luce模型中,当其他偏好发生变化时,某一偏好的概率可能发生显著改变,尤其当这些偏好具有主导性时(即概率接近0或1)。我们明确了此类敏感性在这些模型中变得显著的具体条件,并讨论了其对人工智能系统价值对齐的鲁棒性与安全性的实际影响。