SafePro: Evaluating the Safety of Professional-Level AI Agents

Large language model-based agents are rapidly evolving from simple conversational assistants into autonomous systems capable of performing complex, professional-level tasks in various domains. While these advancements promise significant productivity gains, they also introduce critical safety risks that remain under-explored. Existing safety evaluations primarily focus on simple, daily assistance tasks, failing to capture the intricate decision-making processes and potential consequences of misaligned behaviors in professional settings. To address this gap, we introduce \textbf{SafePro}, a comprehensive benchmark designed to evaluate the safety alignment of AI agents performing professional activities. SafePro features a dataset of high-complexity tasks across diverse professional domains with safety risks, developed through a rigorous iterative creation and review process. Our evaluation of state-of-the-art AI models reveals significant safety vulnerabilities and uncovers new unsafe behaviors in professional contexts. We further show that these models exhibit both insufficient safety judgment and weak safety alignment when executing complex professional tasks. In addition, we investigate safety mitigation strategies for improving agent safety in these scenarios and observe encouraging improvements. Together, our findings highlight the urgent need for robust safety mechanisms tailored to the next generation of professional AI agents.

翻译：基于大语言模型的智能体正从简单的对话助手迅速发展为能够在各领域执行复杂专业任务的自主系统。尽管这些进步有望带来显著的生产力提升，但也引入了尚未被充分探索的关键安全风险。现有的安全评估主要关注简单的日常辅助任务，未能捕捉专业场景中复杂的决策过程与行为失准的潜在后果。为填补这一空白，我们提出\textbf{SafePro}——一个用于评估执行专业活动的AI智能体安全对齐性的综合基准。SafePro包含一个跨多专业领域的高复杂度任务数据集，这些任务均具有安全风险，并通过严格的迭代创建与审核流程开发而成。我们对前沿AI模型的评估揭示了显著的安全漏洞，并发现了专业情境中新的不安全行为。我们进一步证明，这些模型在执行复杂专业任务时既表现出安全判断力不足，也显示出薄弱的安全对齐性。此外，我们研究了提升智能体在此类场景中安全性的缓解策略，并观察到令人鼓舞的改进效果。综合而言，我们的研究结果凸显了为下一代专业AI智能体量身定制鲁棒安全机制的迫切需求。

相关内容

关注 7107

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

AI智能体时代大模型安全风险与攻防新挑战

专知会员服务

15+阅读 · 2月27日

保护网络物理系统中的 AI 智能体：关于环境交互、深度伪造威胁及其防御技术的综述

专知会员服务

10+阅读 · 2月15日

智能体化 AI 与网络安全综述：挑战、机遇与用例原型

专知会员服务

29+阅读 · 1月13日

AI 智能体系统：体系架构、应用场景及评估范式

专知会员服务

69+阅读 · 1月6日