Reinforcement learning (RL) with outcome-based rewards has achieved significant success in training large language model (LLM) agents for complex reasoning tasks. However, in active reasoning where agents need to strategically ask questions to acquire task-relevant information, we find that LLM agents trained with RL often suffer from information self-locking: the agent ceases to ask informative questions and struggles to internalize already-obtained information. To understand the phenomenon, we decompose active reasoning into two core capabilities: Action Selection (AS), which determines the observation stream through queries, and Belief Tracking (BT), which updates the agent's belief based on collected evidence. We show that deficient AS and BT capabilities will limit the information exploration during RL training. Furthermore, insufficient exploration in turn hinders the improvement of AS and BT, creating a feedback loop that locks the agent in a low-information regime. To resolve the issue, we propose a simple yet effective approach that reallocates the learning signal by injecting easy- to-obtain directional critiques to help the agent escape self-locking. Extensive experiments with 7 datasets show that our approach significantly mitigates the information self-locking, bringing up to 60% improvements.
翻译:基于结果奖励的强化学习在训练大型语言模型智能体执行复杂推理任务方面取得了显著成功。然而,在需要智能体通过策略性提问来获取任务相关信息的主动推理场景中,我们发现经强化学习训练的大型语言模型智能体常受困于信息自锁定现象:智能体停止提出信息性提问,且难以有效内化已获取的信息。为解析该现象,我们将主动推理分解为两个核心能力:行动选择(决定通过查询获取的观察流)和信念追踪(根据收集的证据更新智能体信念)。我们证明,行动选择与信念追踪能力的缺陷会限制强化学习训练期间的信息探索。此外,探索不足又会反过来阻碍行动选择与信念追踪能力的提升,从而形成使智能体陷入低信息状态的反馈循环。为解决该问题,我们提出一种简单而有效的方法,通过注入易于获取的方向性评判来重新分配学习信号,以帮助智能体摆脱自锁定状态。在7个数据集上的大量实验表明,我们的方法能显著缓解信息自锁定问题,带来最高达60%的性能提升。