LLM-based agents solve complex tasks through iterative reasoning, tool use, and environment interaction, where each intermediate thought directly shapes subsequent actions. Small deviations in these thoughts can therefore propagate into unsafe behaviors, yet existing guardrails typically operate only on final outputs or require intrusive model modifications. We introduce Thought-Aligner, a lightweight plug-in safety model that performs causal correction on unsafe thoughts before action execution, without altering the underlying agent. The corrected thoughts are fed back into the agent, steering its decision process and tool use toward safer trajectories. Because it operates solely at the thought level, Thought-Aligner is model-agnostic and can be integrated into diverse agent frameworks. We train Thought-Aligner via two-stage contrastive learning on paired safe and unsafe thoughts generated across ten risk scenarios. Experiments on diverse agent-safety benchmarks and six LLMs show that Thought-Aligner increases behavioral safety from about 50% without protection to around 90% on average, exceeding state-of-the-art guardrails by roughly 23%, while also improving helpfulness by about 5%. The method incurs low per-step latency and minimal overhead, enabling scalable and practical deployment. We publicly release Thought-Aligner-7B at https://huggingface.co/WhitzardAgent/Thought-Aligner-7B.
翻译:基于大语言模型的智能体通过迭代推理、工具使用和环境交互解决复杂任务,其中每个中间思维会直接影响后续行动。思维中的微小偏差可能传播为不安全行为,而现有护栏机制通常仅作用于最终输出或需要对模型进行侵入式修改。我们提出轻量级即插即用安全模型Thought-Aligner,它在行动执行前对不安全思维进行因果修正,无需改变底层智能体架构。修正后的思维被重新注入智能体,引导其决策过程和工具使用朝向更安全的轨迹。由于仅在思维层面运作,Thought-Aligner具有模型无关性,可集成到各类智能体框架中。我们通过两阶段对比学习,在十个风险场景中生成的成对安全/不安全思维上训练Thought-Aligner。在多样化的智能体安全基准测试和六个大语言模型上的实验表明,Thought-Aligner将行为安全性从无保护时的约50%提升至平均约90%,超越现有最佳护栏约23%,同时将有益性提升约5%。该方法具有较低的每步延迟和极小的额外开销,支持可扩展的实际部署。我们在https://huggingface.co/WhitzardAgent/Thought-Aligner-7B公开发布Thought-Aligner-7B模型。