Questionnaire Responses Do not Capture the Safety of AI Agents

As AI systems advance in capabilities, measuring their safety and alignment to human values is becoming paramount. A fast-growing field of AI research is devoted to developing such assessments. However, most current advances therein may be ill-suited for assessing AI systems across real-world deployments. Standard methods prompt large language models (LLMs) in a questionnaire-style to describe their values or behavior in hypothetical scenarios. By focusing on unaugmented LLMs, they fall short of evaluating AI agents, which could actually perform relevant behaviors, hence posing much greater risks. LLMs' engagement with scenarios described by questionnaire-style prompts differs starkly from that of agents based on the same LLMs, as reflected in divergences in the inputs, possible actions, environmental interactions, and internal processing. As such, LLMs' responses to scenario descriptions are unlikely to be representative of the corresponding LLM agents' behavior. We further contend that such assessments make strong assumptions concerning the ability and tendency of LLMs to report accurately about their counterfactual behavior. This makes them inadequate to assess risks from AI systems in real-world contexts as they lack construct validity. We then argue that a structurally identical issue holds for current AI alignment approaches. Lastly, we discuss improving safety assessments and alignment training by taking these shortcomings to heart.

翻译：随着人工智能系统能力不断提升，衡量其安全性及与人类价值观的对齐性变得至关重要。一个快速发展的AI研究领域正致力于开发此类评估方法。然而，当前该领域的大多数进展可能并不适用于评估现实世界部署中的AI系统。标准方法采用问卷式提示，要求大型语言模型（LLMs）描述其在假设情境中的价值观或行为。这些方法聚焦于未经增强的LLMs，未能评估实际可能执行相关行为的AI智能体——后者带来的风险要大得多。LLMs对问卷式提示所描述情境的参与方式，与基于相同LLMs构建的智能体存在显著差异，这体现在输入信息、可行动作、环境交互和内部处理等多个层面的分歧上。因此，LLMs对情境描述的回应不太可能代表相应LLM智能体的实际行为。我们进一步指出，此类评估方法对LLMs准确报告其反事实行为的能力和倾向性作出了强假设。由于缺乏结构效度，这些方法不足以评估现实场景中AI系统带来的风险。我们继而论证当前AI对齐方法也存在结构上完全相同的问题。最后，我们讨论了通过正视这些缺陷来改进安全性评估与对齐训练的路径。

相关内容

关注 7106

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

基于大语言模型的智能体易产生幻觉：分类体系、方法与未来方向综述

专知会员服务

31+阅读 · 2025年9月27日

264页pdf！基础智能体的进展与挑战：从类脑智能到进化式、协作式与安全系统

专知会员服务

66+阅读 · 2025年4月5日

AI在医疗中的安全挑战

专知会员服务

19+阅读 · 2024年10月5日

生成式人工智能大型语言模型的安全性：概述

专知会员服务

35+阅读 · 2024年7月30日