HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions

Xuhui Zhou,Hyunwoo Kim,Faeze Brahman,Liwei Jiang,Hao Zhu,Ximing Lu,Frank Xu,Bill Yuchen Lin,Yejin Choi,Niloofar Mireshghallah,Ronan Le Bras,Maarten Sap

from arxiv, Both the second and third authors contributed equally

AI agents are increasingly autonomous in their interactions with human users and tools, leading to increased interactional safety risks. We present HAICOSYSTEM, a framework examining AI agent safety within diverse and complex social interactions. HAICOSYSTEM features a modular sandbox environment that simulates multi-turn interactions between human users and AI agents, where the AI agents are equipped with a variety of tools (e.g., patient management platforms) to navigate diverse scenarios (e.g., a user attempting to access other patients' profiles). To examine the safety of AI agents in these interactions, we develop a comprehensive multi-dimensional evaluation framework that uses metrics covering operational, content-related, societal, and legal risks. Through running 1840 simulations based on 92 scenarios across seven domains (e.g., healthcare, finance, education), we demonstrate that HAICOSYSTEM can emulate realistic user-AI interactions and complex tool use by AI agents. Our experiments show that state-of-the-art LLMs, both proprietary and open-sourced, exhibit safety risks in over 50\% cases, with models generally showing higher risks when interacting with simulated malicious users. Our findings highlight the ongoing challenge of building agents that can safely navigate complex interactions, particularly when faced with malicious users. To foster the AI agent safety ecosystem, we release a code platform that allows practitioners to create custom scenarios, simulate interactions, and evaluate the safety and performance of their agents.

翻译：人工智能代理在与人类用户和工具的交互中日益自主，这导致交互安全风险不断增加。我们提出了HAICOSYSTEM，这是一个在多样且复杂的社会交互中检验人工智能代理安全性的框架。HAICOSYSTEM采用模块化沙箱环境，模拟人类用户与人工智能代理之间的多轮交互。在该环境中，人工智能代理配备多种工具（例如患者管理平台），以应对多样化的场景（例如用户试图访问其他患者的档案）。为了检验人工智能代理在这些交互中的安全性，我们开发了一个全面的多维度评估框架，该框架使用的指标涵盖操作风险、内容相关风险、社会风险和法律风险。通过在七个领域（例如医疗保健、金融、教育）基于92个场景运行1840次模拟，我们证明HAICOSYSTEM能够模拟真实的用户-人工智能交互以及人工智能代理对复杂工具的使用。我们的实验表明，无论是专有模型还是开源模型，最先进的大语言模型在超过50%的案例中均表现出安全风险，且模型在与模拟恶意用户交互时通常表现出更高的风险。我们的研究结果突显了构建能够安全驾驭复杂交互，尤其是在面对恶意用户时的人工智能代理所面临的持续挑战。为了促进人工智能代理安全生态系统的发展，我们发布了一个代码平台，允许从业者创建自定义场景、模拟交互并评估其代理的安全性和性能。