LLM-based autonomous agents possess capabilities such as reasoning, tool invocation, and environment interaction, enabling the execution of complex multi-step tasks. The internal reasoning process, i.e., thought, of behavioral trajectory significantly influences tool usage and subsequent actions but can introduce potential risks. Even minor deviations in the agent's thought may trigger cascading effects leading to irreversible safety incidents. To address the safety alignment challenges in long-horizon behavioral trajectories, we propose Thought-Aligner, a plug-in dynamic thought correction module. Utilizing a lightweight and resource-efficient model, Thought-Aligner corrects each high-risk thought on the fly before each action execution. The corrected thought is then reintroduced to the agent, ensuring safer subsequent decisions and tool interactions. Importantly, Thought-Aligner modifies only the reasoning phase without altering the underlying agent framework, making it easy to deploy and widely applicable to various agent frameworks. To train the Thought-Aligner model, we construct an instruction dataset across ten representative scenarios and simulate ReAct execution trajectories, generating 5,000 diverse instructions and more than 11,400 safe and unsafe thought pairs. The model is fine-tuned using contrastive learning techniques. Experiments across three agent safety benchmarks involving 12 different LLMs demonstrate that Thought-Aligner raises agent behavioral safety from approximately 50% in the unprotected setting to 90% on average. Additionally, Thought-Aligner maintains response latency below 100ms with minimal resource usage, demonstrating its capability for efficient deployment, broad applicability, and timely responsiveness. This method thus provides a practical dynamic safety solution for the LLM-based agents.
翻译:基于大语言模型(LLM)的自主智能体具备推理、工具调用与环境交互等能力,能够执行复杂的多步骤任务。行为轨迹中的内部推理过程(即思维)显著影响工具使用与后续行动,但也可能引入潜在风险。即使智能体思维出现微小偏差,也可能引发连锁反应,导致不可逆的安全事故。为应对长程行为轨迹中的安全对齐挑战,本文提出Thought-Aligner——一种插件式动态思维校正模块。该模块采用轻量化且资源高效的模型,在每次行动执行前实时校正高风险思维。校正后的思维重新注入智能体,确保后续决策与工具交互的安全性。值得注意的是,Thought-Aligner仅修改推理阶段而不改变底层智能体框架,使其易于部署并广泛适用于各类智能体架构。为训练Thought-Aligner模型,我们在十个代表性场景中构建指令数据集,模拟ReAct执行轨迹,生成了5,000条多样化指令及超过11,400组安全与不安全思维对。模型采用对比学习技术进行微调。在涵盖12种不同LLM的三个智能体安全基准测试中,实验表明Thought-Aligner将智能体行为安全率从无保护状态下的约50%提升至平均90%。此外,Thought-Aligner将响应延迟维持在100毫秒以下且资源消耗极低,证明了其高效部署能力、广泛适用性与实时响应性。该方法为基于LLM的智能体提供了一种实用的动态安全解决方案。