Large language models are increasingly used to represent human opinions, values, or beliefs, and their steerability towards these ideals is an active area of research. Existing work focuses predominantly on aligning marginal response distributions, treating each survey item independently. While essential, this may overlook deeper latent structures that characterise real populations and underpin cultural values theories. We propose a framework for evaluating the representativeness of aligned models through multivariate correlation patterns in addition to marginal distributions. We show the value of our evaluation scheme by comparing two model steering techniques (persona prompting and demographic fine-tuning) and evaluating them against human responses from the World Values Survey. While the demographically fine-tuned model better approximates marginal response distributions than persona prompting, both techniques fail to fully capture the gold standard correlation patterns. We conclude that representativeness is a distinct aspect of value alignment and an evaluation focused on marginals can mask structural failures, leading to overly optimistic conclusions about model capabilities.
翻译:大语言模型日益被用于代表人类观点、价值观或信念,而将其导向这些理想状态的可操控性已成为一个活跃的研究领域。现有工作主要集中于对齐边缘响应分布,将每个调查项目视为独立处理。尽管这至关重要,但可能忽略了表征真实人群并支撑文化价值观理论的更深层潜在结构。我们提出了一个框架,通过多元相关模式以及边缘分布来评估对齐模型的代表性。我们通过比较两种模型操控技术(角色提示和人口统计学微调),并对照世界价值观调查的人类响应进行评估,展示了我们评估方案的价值。虽然经过人口统计学微调的模型比角色提示更好地近似边缘响应分布,但两种技术都未能完全捕捉黄金标准的相关模式。我们的结论是,代表性是价值对齐的一个独特方面,而仅关注边缘分布的评估可能掩盖结构上的失败,导致对模型能力得出过于乐观的结论。