Agentic reinforcement learning (RL) trains large language models to use tools, but its impact on alignment is poorly understood. We study how agentic RL for search affects the alignment of instruction-tuned (IT) models. We find that RL-trained models inherit refusal reasoning by deflecting harmful requests into benign search queries, but this breaks down under a simple diagnostic trigger that elicits a search call before refusal can occur. Under this condition, RL models produce multi-step unsafe search actions and reasoning, reducing search query safety by up to 68.6% in Qwen and Llama models relative to their IT counterparts. The effect generalises across model families, scales, and RL algorithms. To understand why, we identify linear directions in the residual stream that control search query safety, and show that RL training progressively shifts search behaviour toward the harmful end of this direction. We thus propose representation-guided RL training, which adds a reward penalty based on projection toward the harmful search direction. Training on benign data alone, it restores IT-level alignment without reducing task accuracy and requires no additional training data. Together, our work provides the first framework for diagnosing, mechanistically analysing, and mitigating alignment degradation in agentic RL for search.
翻译:智能体强化学习(Agentic RL)训练大型语言模型使用工具,但其对对齐能力的影响尚不明确。我们研究了面向搜索的智能体强化学习如何影响指令微调(IT)模型的对齐效果。研究发现,经过强化学习训练的模型会通过将有害请求转化为无害搜索查询来继承拒绝推理机制,但这一机制在诊断性触发条件下会被破坏——该条件在拒绝发生前诱发搜索调用。在此条件下,RL模型会产生多步不安全的搜索行动与推理过程,相较于对应的IT模型,Qwen和Llama系列模型的搜索查询安全性最多下降68.6%。该效应在不同模型家族、规模及强化学习算法中普遍存在。为探究成因,我们识别出残差流中控制搜索查询安全性的线性方向,并证明RL训练会逐步将搜索行为向该方向的有害端偏移。据此我们提出表征引导的强化学习训练方法,通过基于有害搜索方向投影的奖励惩罚项,仅利用良性数据即可在保持任务准确率的前提下恢复IT级对齐,且无需额外训练数据。本研究首次构建了面向搜索的智能体强化学习中对齐退化的诊断、机制分析与缓解框架。