Large language model-based agents are rapidly evolving from simple conversational assistants into autonomous systems capable of performing complex, professional-level tasks in various domains. While these advancements promise significant productivity gains, they also introduce critical safety risks that remain under-explored. Existing safety evaluations primarily focus on simple, daily assistance tasks, failing to capture the intricate decision-making processes and potential consequences of misaligned behaviors in professional settings. To address this gap, we introduce \textbf{SafePro}, a comprehensive benchmark designed to evaluate the safety alignment of AI agents performing professional activities. SafePro features a dataset of high-complexity tasks across diverse professional domains with safety risks, developed through a rigorous iterative creation and review process. Our evaluation of state-of-the-art AI models reveals significant safety vulnerabilities and uncovers new unsafe behaviors in professional contexts. We further show that these models exhibit both insufficient safety judgment and weak safety alignment when executing complex professional tasks. In addition, we investigate safety mitigation strategies for improving agent safety in these scenarios and observe encouraging improvements. Together, our findings highlight the urgent need for robust safety mechanisms tailored to the next generation of professional AI agents.
翻译:基于大语言模型的智能体正从简单的对话助手迅速发展为能够在各领域执行复杂专业任务的自主系统。尽管这些进步有望带来显著的生产力提升,但也引入了尚未被充分探索的关键安全风险。现有的安全评估主要关注简单的日常辅助任务,未能捕捉专业场景中复杂的决策过程与行为失准的潜在后果。为填补这一空白,我们提出\textbf{SafePro}——一个用于评估执行专业活动的AI智能体安全对齐性的综合基准。SafePro包含一个跨多专业领域的高复杂度任务数据集,这些任务均具有安全风险,并通过严格的迭代创建与审核流程开发而成。我们对前沿AI模型的评估揭示了显著的安全漏洞,并发现了专业情境中新的不安全行为。我们进一步证明,这些模型在执行复杂专业任务时既表现出安全判断力不足,也显示出薄弱的安全对齐性。此外,我们研究了提升智能体在此类场景中安全性的缓解策略,并观察到令人鼓舞的改进效果。综合而言,我们的研究结果凸显了为下一代专业AI智能体量身定制鲁棒安全机制的迫切需求。