AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

Dongrui Liu,Yu Li,Zhonghao Yang,Peng Wang,Guanxu Chen,Yuejin Xie,Qinghua Mao,Wanying Qu,Yanxu Zhu,Tianyi Zhou,Leitao Yuan,Zhijie Zheng,Qihao Lin,Yimin Wang,Haoyu Luo,Shuai Shao,Chen Qian,Qingyu Liu,Ling Tang,Ruiyang Qin,Qihan Ren,Junxiao Yang,Kun Wang,Zhiheng Xi,Linfeng Zhang,Ranjie Duan,Bo Zhang,Wenjie Wang,Wen Shen,Qiaosheng Zhang,Yan Teng,Chaochao Lu,Rui Mei,Man Li,Jialing Tao,Xi Lin,Tianhang Zheng,Yong Liu,Quanshi Zhang,Lei Zhu,Xingjun Ma,Junhua Liu,Hui Xue,Xiaoxiang Zuo,Xiangnan He,Chao Shen,Xianglong Liu,Minlie Huang,Jing Shao,Xia Hu

from arxiv, 44 pages, 12 Figures, 9 Tables

Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex and OpenClaw execution scenarios. We further build a taxonomy-guided data engine with influence-function purification to train lightweight AgentDoG 1.5 variants (0.8B, 2B, 4B, and 8B parameters) using only around 1k samples, achieving comparable performance with leading closed-source models (e.g., GPT-5.4). Based on AgentDoG 1.5, we construct a highly efficient agentic safety SFT and RL training environment, which reduces deployment overhead in Docker-level environments by two orders of magnitude. Finally, we deploy AgentDoG 1.5 as a training-free online guardrail for real-time safety moderation. Extensive experimental results indicate that AgentDoG 1.5 achieves state-of-the-art performance in diverse and complex interactive agentic scenarios. All models and datasets are openly released.

翻译：现代开放世界智能体（如OpenClaw）展现出强大的跨环境执行能力，但也带来了全新的安全风险源。同时，前沿AI模型大幅降低了攻击门槛，导致现有智能体对齐框架难以满足真实部署需求。为应对这些新兴威胁，我们提出了一种轻量级且可扩展的智能体安全对齐框架。具体而言，我们更新了智能体安全分类体系以涵盖来自Codex和OpenClaw执行场景的涌现风险，并进一步构建了基于影响力函数净化的分类引导数据引擎，仅使用约1000个样本即可训练轻量级AgentDoG 1.5变体（参数量0.8B/2B/4B/8B），达到与领先闭源模型（如GPT-5.4）相近的性能。基于AgentDoG 1.5，我们搭建了高效的智能体安全SFT和RL训练环境，将Docker级环境的部署开销降低两个数量级。最终我们将AgentDoG 1.5部署为免训练的在线防护栏实现实时安全审查。大量实验表明，AgentDoG 1.5在多样化的复杂交互式智能体场景中达到了最先进性能。所有模型与数据集均已开源发布。