Agentic language models operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute long-horizon actions where a single misstep, such as accessing files or entering credentials, can cause irreversible harm. Existing alignment methods, largely optimized for static generation and task completion, break down in these settings due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC structures inference as a plan, check, then act or refuse loop, with explicit safety reasoning and refusal as first-class actions. To train without trajectory-level labels, we use preference-based reinforcement learning with pairwise trajectory comparisons, which captures safety distinctions often missed by scalar rewards. We evaluate MOSAIC zero-shot across three model families, Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4, and across out-of-distribution benchmarks spanning harmful tasks, prompt injection, benign tool use, and cross-domain privacy leakage. MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance, demonstrating robust generalization across models, domains, and agentic settings.
翻译:智能体语言模型运行于与聊天模型根本不同的安全机制中:它们必须规划、调用工具并执行长时程行动,其中任何单一步骤的失误(例如访问文件或输入凭证)都可能导致不可逆的损害。现有的对齐方法主要针对静态生成和任务完成进行优化,由于序列决策、对抗性工具反馈以及过度自信的中间推理,在这些场景中会失效。我们提出MOSAIC——一种通过使安全决策显式化且可学习来对齐智能体以实现安全多步工具使用的后训练框架。MOSAIC将推理过程结构化为“规划、检查、然后行动或拒绝”的循环,并将显式安全推理和拒绝作为首要行动。为了在缺乏轨迹级标注的情况下进行训练,我们采用基于偏好的强化学习与成对轨迹比较方法,该方法能捕捉常被标量奖励忽略的安全差异。我们在三个模型系列(Qwen2.5-7B、Qwen3-4B-Thinking和Phi-4)上对MOSAIC进行零样本评估,并覆盖分布外基准测试,包括有害任务、提示注入、良性工具使用和跨域隐私泄露。MOSAIC将有害行为减少高达50%,在注入攻击中对有害任务的拒绝率提升超过20%,降低隐私泄露,同时保持或改善良性任务性能,展现了跨模型、跨领域和跨智能体场景的鲁棒泛化能力。