TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces. While safety guardrails are well-benchmarked for natural language responses, their efficacy remains largely unexplored within multi-step tool-use trajectories. To address this gap, we introduce TraceSafe-Bench, the first comprehensive benchmark specifically designed to assess mid-trajectory safety. It encompasses 12 risk categories, ranging from security threats (e.g., prompt injection, privacy leaks) to operational failures (e.g., hallucinations, interface inconsistencies), featuring over 1,000 unique execution instances. Our evaluation of 13 LLM-as-a-guard models and 7 specialized guardrails yields three critical findings: 1) Structural Bottleneck: Guardrail efficacy is driven more by structural data competence (e.g., JSON parsing) than semantic safety alignment. Performance correlates strongly with structured-to-text benchmarks ($ρ=0.79$) but shows near-zero correlation with standard jailbreak robustness. 2) Architecture over Scale: Model architecture influences risk detection performance more significantly than model size, with general-purpose LLMs consistently outperforming specialized safety guardrails in trajectory analysis. 3) Temporal Stability: Accuracy remains resilient across extended trajectories. Increased execution steps allow models to pivot from static tool definitions to dynamic execution behaviors, actually improving risk detection performance in later stages. Our findings suggest that securing agentic workflows requires jointly optimizing for structural reasoning and safety alignment to effectively mitigate mid-trajectory risks.

翻译：随着大语言模型从静态聊天机器人演变为自主智能体，其主要 vulnerability 面已从最终输出转向中间执行轨迹。尽管针对自然语言响应的安全护栏已有充分基准测试，但它们在多步骤工具使用轨迹中的有效性仍未得到充分探索。为填补这一空白，我们引入了 TraceSafe-Bench——首个专门用于评估轨迹中段安全性的综合基准。该基准涵盖 12 个风险类别，从安全威胁（如提示注入、隐私泄露）到操作故障（如幻觉、接口不一致），包含超过 1,000 个独特执行实例。我们对 13 种大语言模型充当的护栏模型和 7 种专用护栏的评估得出了三个关键发现：1）结构性瓶颈：护栏的有效性更多由结构化数据处理能力（如 JSON 解析）驱动，而非语义安全对齐；性能与结构化到文本基准高度相关（ρ=0.79），但与标准越狱鲁棒性几乎无相关性。2）架构优先于规模：模型架构对风险检测性能的影响比模型规模更显著，通用大语言模型在轨迹分析中始终优于专用安全护栏。3）时间稳定性：在扩展轨迹中，准确性保持稳健。执行步骤的增加使模型能够从静态工具定义转向动态执行行为，实际上在后续阶段提升了风险检测性能。我们的发现表明，保障智能体工作流安全需要联合优化结构推理与安全对齐，以有效缓解轨迹中段风险。