As AI systems advance in capabilities, measuring their safety and alignment to human values is becoming paramount. A fast-growing field of AI research is devoted to developing such assessments. However, most current advances therein may be ill-suited for assessing AI systems across real-world deployments. Standard methods prompt large language models (LLMs) in a questionnaire-style to describe their values or behavior in hypothetical scenarios. By focusing on unaugmented LLMs, they fall short of evaluating AI agents, which could actually perform relevant behaviors, hence posing much greater risks. LLMs' engagement with scenarios described by questionnaire-style prompts differs starkly from that of agents based on the same LLMs, as reflected in divergences in the inputs, possible actions, environmental interactions, and internal processing. As such, LLMs' responses to scenario descriptions are unlikely to be representative of the corresponding LLM agents' behavior. We further contend that such assessments make strong assumptions concerning the ability and tendency of LLMs to report accurately about their counterfactual behavior. This makes them inadequate to assess risks from AI systems in real-world contexts as they lack construct validity. We then argue that a structurally identical issue holds for current AI alignment approaches. Lastly, we discuss improving safety assessments and alignment training by taking these shortcomings to heart.
翻译:随着人工智能系统能力不断提升,衡量其安全性及与人类价值观的对齐性变得至关重要。一个快速发展的AI研究领域正致力于开发此类评估方法。然而,当前该领域的大多数进展可能并不适用于评估现实世界部署中的AI系统。标准方法采用问卷式提示,要求大型语言模型(LLMs)描述其在假设情境中的价值观或行为。这些方法聚焦于未经增强的LLMs,未能评估实际可能执行相关行为的AI智能体——后者带来的风险要大得多。LLMs对问卷式提示所描述情境的参与方式,与基于相同LLMs构建的智能体存在显著差异,这体现在输入信息、可行动作、环境交互和内部处理等多个层面的分歧上。因此,LLMs对情境描述的回应不太可能代表相应LLM智能体的实际行为。我们进一步指出,此类评估方法对LLMs准确报告其反事实行为的能力和倾向性作出了强假设。由于缺乏结构效度,这些方法不足以评估现实场景中AI系统带来的风险。我们继而论证当前AI对齐方法也存在结构上完全相同的问题。最后,我们讨论了通过正视这些缺陷来改进安全性评估与对齐训练的路径。