Large language models are increasingly used to represent human opinions, values, or beliefs, and their steerability towards these ideals is an active area of research. Existing work focuses predominantly on aligning marginal response distributions, treating each survey item independently. While essential, this may overlook deeper latent structures that characterise real populations and underpin cultural values theories. We propose a framework for evaluating the representativeness of aligned models through multivariate correlation patterns in addition to marginal distributions. We show the value of our evaluation scheme by comparing two model steering techniques (persona prompting and demographic fine-tuning) and evaluating them against human responses from the World Values Survey. While the demographically fine-tuned model better approximates marginal response distributions than persona prompting, both techniques fail to fully capture the gold standard correlation patterns. We conclude that representativeness is a distinct aspect of value alignment and an evaluation focused on marginals can mask structural failures, leading to overly optimistic conclusions about model capabilities.
翻译:大语言模型日益被用于代表人类观点、价值观或信念,而使其向这些理想目标可控引导已成为一个活跃的研究领域。现有工作主要侧重于对齐边缘响应分布,将每个调查项目独立处理。尽管这一做法至关重要,但它可能忽略了表征真实群体并支撑文化价值观理论的更深层潜在结构。我们提出了一个框架,通过多元相关模式及边缘分布来评估对齐模型的代表性。我们通过比较两种模型引导技术(角色提示与人口统计学微调)并对照世界价值观调查的人类响应进行评估,展示了我们评估方案的价值。虽然经过人口统计学微调的模型比角色提示能更好地逼近边缘响应分布,但两种技术均未能完全捕捉黄金标准的相关模式。我们的结论是:代表性是价值观对齐的一个独特维度,仅关注边缘分布的评估可能掩盖结构性的缺陷,从而导致对模型能力过于乐观的结论。