Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex and OpenClaw execution scenarios. We further build a taxonomy-guided data engine with influence-function purification to train lightweight AgentDoG 1.5 variants (0.8B, 2B, 4B, and 8B parameters) using only around 1k samples, achieving comparable performance with leading closed-source models (e.g., GPT-5.4). Based on AgentDoG 1.5, we construct a highly efficient agentic safety SFT and RL training environment, which reduces deployment overhead in Docker-level environments by two orders of magnitude. Finally, we deploy AgentDoG 1.5 as a training-free online guardrail for real-time safety moderation. Extensive experimental results indicate that AgentDoG 1.5 achieves state-of-the-art performance in diverse and complex interactive agentic scenarios. All models and datasets are openly released.
翻译:现代开放世界智能体(如OpenClaw)展现出强大的跨环境执行能力,但也带来了全新的安全风险源。同时,前沿AI模型大幅降低了攻击门槛,导致现有智能体对齐框架难以满足真实部署需求。为应对这些新兴威胁,我们提出了一种轻量级且可扩展的智能体安全对齐框架。具体而言,我们更新了智能体安全分类体系以涵盖来自Codex和OpenClaw执行场景的涌现风险,并进一步构建了基于影响力函数净化的分类引导数据引擎,仅使用约1000个样本即可训练轻量级AgentDoG 1.5变体(参数量0.8B/2B/4B/8B),达到与领先闭源模型(如GPT-5.4)相近的性能。基于AgentDoG 1.5,我们搭建了高效的智能体安全SFT和RL训练环境,将Docker级环境的部署开销降低两个数量级。最终我们将AgentDoG 1.5部署为免训练的在线防护栏实现实时安全审查。大量实验表明,AgentDoG 1.5在多样化的复杂交互式智能体场景中达到了最先进性能。所有模型与数据集均已开源发布。