LLM-based agents are becoming increasingly capable, yet their safety lags behind. This creates a gap between what agents can do and should do. This gap widens as agents engage in multi-turn interactions and employ diverse tools, introducing new risks overlooked by existing benchmarks. To systematically scale safety testing into multi-turn, tool-realistic settings, we propose a principled taxonomy that transforms single-turn harmful tasks into multi-turn attack sequences. Using this taxonomy, we construct MT-AgentRisk (Multi-Turn Agent Risk Benchmark), the first benchmark to evaluate multi-turn tool-using agent safety. Our experiments reveal substantial safety degradation: the Attack Success Rate (ASR) increases by 16% on average across open and closed models in multi-turn settings. To close this gap, we propose ToolShield, a training-free, tool-agnostic, self-exploration defense: when encountering a new tool, the agent autonomously generates test cases, executes them to observe downstream effects, and distills safety experiences for deployment. Experiments show that ToolShield effectively reduces ASR by 30% on average in multi-turn interactions. Our code is available at https://github.com/CHATS-lab/ToolShield.
翻译:基于大语言模型的智能体能力日益增强,但其安全性却相对滞后,形成了智能体“能做什么”与“应做什么”之间的差距。当智能体进行多轮交互并使用多样化工具时,这一差距会进一步扩大,并引入现有基准测试所忽视的新型风险。为系统性地将安全测试扩展到多轮、工具现实的场景中,我们提出了一种原则性分类法,将单轮有害任务转化为多轮攻击序列。基于此分类法,我们构建了MT-AgentRisk(多轮智能体风险基准),这是首个用于评估多轮工具使用智能体安全性的基准测试。实验结果表明安全性显著下降:在多轮交互场景中,开放与闭源模型的平均攻击成功率(ASR)上升了16%。为弥合这一差距,我们提出了ToolShield——一种免训练、工具无关的自主探索防御机制:当遇到新工具时,智能体自主生成测试用例,执行测试以观察下游影响,并提炼安全经验用于部署。实验表明,ToolShield在多轮交互中平均可有效降低30%的ASR。相关代码已发布于https://github.com/CHATS-lab/ToolShield。