LLM-based agents are becoming increasingly capable, yet their safety lags behind. This creates a gap between what agents can do and should do. This gap widens as agents engage in multi-turn interactions and employ diverse tools, introducing new risks overlooked by existing benchmarks. To systematically scale safety testing into multi-turn, tool-realistic settings, we propose a principled taxonomy that transforms single-turn harmful tasks into multi-turn attack sequences. Using this taxonomy, we construct MT-AgentRisk (Multi-Turn Agent Risk Benchmark), the first benchmark to evaluate multi-turn tool-using agent safety. Our experiments reveal substantial safety degradation: the Attack Success Rate (ASR) increases by 16% on average across open and closed models in multi-turn settings. To close this gap, we propose ToolShield, a training-free, tool-agnostic, self-exploration defense: when encountering a new tool, the agent autonomously generates test cases, executes them to observe downstream effects, and distills safety experiences for deployment. Experiments show that ToolShield effectively reduces ASR by 30% on average in multi-turn interactions. Our code is available at https://github.com/CHATS-lab/ToolShield.
翻译:基于大语言模型的智能体能力日益增强,但其安全性仍相对滞后,导致其“能做什么”与“应做什么”之间存在差距。当智能体进行多轮交互并调用多种工具时,这一差距进一步扩大,引入了现有基准测试未能覆盖的新型风险。为系统性地将安全测试扩展至多轮、工具逼真的场景,我们提出一套原则性分类体系,将单轮有害任务转化为多轮攻击序列。基于该分类体系,我们构建了MT-AgentRisk(多轮智能体风险基准测试),这是首个评估多轮工具使用智能体安全性的基准。实验表明,多轮设置下安全性能显著下降:开源与闭源模型的攻击成功率(ASR)平均提升16%。为缩小这一差距,我们提出ToolShield——一种无需训练、工具无关、自我探索的防御机制:当遇到新工具时,智能体自动生成测试用例并执行以观察下游效应,进而提炼安全经验用于部署。实验表明,在多轮交互中,ToolShield可将ASR平均降低30%。代码已开源:https://github.com/CHATS-lab/ToolShield。