AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Dongrui Liu,Qihan Ren,Chen Qian,Shuai Shao,Yuejin Xie,Yu Li,Zhonghao Yang,Haoyu Luo,Peng Wang,Qingyu Liu,Binxin Hu,Ling Tang,Jilin Mei,Dadi Guo,Leitao Yuan,Junyao Yang,Guanxu Chen,Qihao Lin,Yi Yu,Bo Zhang,Jiaxuan Guo,Jie Zhang,Wenqi Shao,Huiqi Deng,Zhiheng Xi,Wenjie Wang,Wenxuan Wang,Wen Shen,Zhikai Chen,Haoyu Xie,Jialing Tao,Juntao Dai,Jiaming Ji,Zhongjie Ba,Linfeng Zhang,Yong Liu,Quanshi Zhang,Lei Zhu,Zhihua Wei,Hui Xue,Chaochao Lu,Jing Shao,Xia Hu

from arxiv, 40 pages, 26 figures

The rise of AI agents introduces complex safety and security challenges arising from autonomous tool use and environmental interactions. Current guardrail models lack agentic risk awareness and transparency in risk diagnosis. To introduce an agentic guardrail that covers complex and numerous risky behaviors, we first propose a unified three-dimensional taxonomy that orthogonally categorizes agentic risks by their source (where), failure mode (how), and consequence (what). Guided by this structured and hierarchical taxonomy, we introduce a new fine-grained agentic safety benchmark (ATBench) and a Diagnostic Guardrail framework for agent safety and security (AgentDoG). AgentDoG provides fine-grained and contextual monitoring across agent trajectories. More Crucially, AgentDoG can diagnose the root causes of unsafe actions and seemingly safe but unreasonable actions, offering provenance and transparency beyond binary labels to facilitate effective agent alignment. AgentDoG variants are available in three sizes (4B, 7B, and 8B parameters) across Qwen and Llama model families. Extensive experimental results demonstrate that AgentDoG achieves state-of-the-art performance in agentic safety moderation in diverse and complex interactive scenarios. All models and datasets are openly released.

翻译：人工智能智能体的崛起带来了因自主工具使用和环境交互而产生的复杂安全与安保挑战。当前的护栏模型在智能体风险感知和风险诊断透明度方面存在不足。为引入一个覆盖复杂且众多风险行为的智能体护栏，我们首先提出一个统一的三维分类法，该分类法正交地按来源（何处）、失效模式（如何）和后果（何事）对智能体风险进行分类。在这一结构化和层次化分类法的指导下，我们引入了一个新的细粒度智能体安全基准（ATBench）和一个用于智能体安全与安保的诊断性护栏框架（AgentDoG）。AgentDoG在智能体轨迹中提供细粒度和上下文感知的监控。更重要的是，AgentDoG能够诊断不安全行为以及看似安全但实为不合理行为的根本原因，提供超越二元标签的溯源和透明度，以促进有效的智能体对齐。AgentDoG的变体在Qwen和Llama模型系列中提供三种尺寸（4B、7B和8B参数）。大量实验结果表明，AgentDoG在多样且复杂的交互场景中达到了智能体安全审核的最新性能。所有模型和数据集均已公开。