Evaluating AI Alignment in LLMs: Output Analysis of Value Priorities Across 75 Models with Human Benchmarking

Large language models (LLMs) are increasingly used in human-AI interaction research and practice, yet existing capability and safety benchmarks reveal little about the value priorities these systems express or how those priorities correspond to human judgements. Across three studies, we introduce an output-based approach to evaluating one facet of AI alignment by treating LLM-generated text as behavioural data and comparing expressed value-priority profiles with a human reference. Study 1 used inductive qualitative analysis to derive six themes of optimal AI functioning, namely Performance, Adaptive Capacity, Social Good, Ethics and Responsibility, Relational Integration, and Agency. Study 2 showed that LLM outputs were highly stable within models and converged on a common value-priority structure across models, indicating reliable and comparable value profiles. Study 3 benchmarked 75 contemporary LLMs against 376 human respondents using a profile-fidelity metric capturing both the relative ordering of priorities and the calibration of between-priority differences. Although most models reproduced the human ordering of values, some systematically exaggerated the differences between them, showing that models can appear aligned on conventional benchmarks while still diverging from human value calibration. Profile fidelity varied substantially across models and did not consistently scale with size, recency, or capability tier. Both LLMs and humans converged on a deprioritisation of Agency, raising important questions about the development of increasingly agentic AI systems. For research and applied use, the six themes and profile-based metric provide a scalable method for auditing LLM value profiles before deployment in contexts where alignment with human priorities is critical.

翻译：大型语言模型（LLMs）日益广泛应用于人机交互研究与实践，但现有能力与安全基准测试鲜少揭示这些系统所表达的价值优先级，以及这些优先级与人类判断的对应关系。通过三项研究，我们提出了一种基于输出的方法，将LLM生成的文本视为行为数据，并通过比较其表达的价值优先级剖面与人类参照，来评估AI对齐的一个方面。研究1采用归纳式定性分析，提炼出最优AI功能的六大主题：性能、适应能力、社会福祉、伦理与责任、关系整合及自主性。研究2表明，LLM输出在模型内部高度稳定，且不同模型间趋向于共同的价值优先级结构，显示出可靠且可比较的价值剖面。研究3利用剖面保真度指标（该指标既捕捉优先级相对排序，也校准优先级间差异），将75个当代LLM与376名人类受访者进行基准对比。尽管多数模型再现了人类的价值排序，但部分模型系统地夸大了价值间差异，表明模型在传统基准测试中看似对齐，实则与人类价值校准仍存在偏差。剖面保真度因模型而异，且与模型规模、更新时间或能力层级无稳定关联。LLM与人类均呈现出对自主性的优先降低趋势，这引发了对日益自主化AI系统开发的重要思考。在研究与实际应用中，这六大主题与基于剖面的指标为在部署需要与人类优先级对齐的LLM前，提供了一种可扩展的审计方法。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/