While LLM-based agents can interact with environments via invoking external tools, their expanded capabilities also amplify security risks. Monitoring step-level tool invocation behaviors in real time and proactively intervening before unsafe execution is critical for agent deployment, yet remains under-explored. In this work, we first construct TS-Bench, a novel benchmark for step-level tool invocation safety detection in LLM agents. We then develop a guardrail model, TS-Guard, using multi-task reinforcement learning. The model proactively detects unsafe tool invocation actions before execution by reasoning over the interaction history. It assesses request harmfulness and action-attack correlations, producing interpretable and generalizable safety judgments and feedback. Furthermore, we introduce TS-Flow, a guardrail-feedback-driven reasoning framework for LLM agents, which reduces harmful tool invocations of ReAct-style agents by 65 percent on average and improves benign task completion by approximately 10 percent under prompt injection attacks.
翻译:尽管基于大语言模型(LLM)的智能体能够通过调用外部工具与环境交互,但其扩展的能力也同时放大了安全风险。实时监控步骤级工具调用行为并在不安全执行前主动干预,对于智能体部署至关重要,但目前相关研究仍显不足。在本工作中,我们首先构建了TS-Bench——一个面向LLM智能体中步骤级工具调用安全性检测的新型基准。随后,我们利用多任务强化学习开发了护栏模型TS-Guard。该模型通过分析交互历史,在执行前主动检测不安全的工具调用行为。它评估请求的危害性及行为-攻击关联性,生成可解释且可泛化的安全性判断与反馈。此外,我们提出了TS-Flow——一个面向LLM智能体的护栏-反馈驱动推理框架。该框架在提示注入攻击下,平均将ReAct风格智能体的有害工具调用减少了65%,并将良性任务完成率提升了约10%。