The emergent capabilities of Large Language Models (LLMs) have made it crucial to align their values with those of humans. Current methodologies typically attempt alignment with a homogeneous human value and requires human verification, yet lack consensus on the desired aspect and depth of alignment and resulting human biases. In this paper, we propose A2EHV, an Automated Alignment Evaluation with a Heterogeneous Value system that (1) is automated to minimize individual human biases, and (2) allows assessments against various target values to foster heterogeneous agents. Our approach pivots on the concept of value rationality, which represents the ability for agents to execute behaviors that satisfy a target value the most. The quantification of value rationality is facilitated by the Social Value Orientation framework from social psychology, which partitions the value space into four categories to assess social preferences from agents' behaviors. We evaluate the value rationality of eight mainstream LLMs and observe that large models are more inclined to align neutral values compared to those with strong personal values. By examining the behavior of these LLMs, we contribute to a deeper understanding of value alignment within a heterogeneous value system.
翻译:大型语言模型(LLM)的涌现能力使得将其价值观与人类对齐变得至关重要。当前方法通常试图与同质化的人类价值观对齐,并需要人工验证,但在对齐的目标维度、深度以及由此产生的人类偏差方面缺乏共识。本文提出A2EHV——一种基于异构价值系统的自动化对齐评估方法,其特点在于:(1)通过自动化流程最小化个体人类偏差,(2)允许针对多种目标价值进行评估,从而培育异构智能体。我们的方法基于价值理性这一核心概念,它体现了智能体执行最符合目标价值行为的能力。借助社会心理学中的社会价值取向框架,我们将价值空间划分为四个类别,以评估智能体行为中的社会偏好。通过对八个主流LLM的价值理性进行评估,我们发现大型模型更倾向于与中性价值观对齐,而非与强个人价值观对齐。通过分析这些LLM的行为,我们深化了对异构价值系统中价值对齐机制的理解。