Recent search-augmented LLMs trained with reinforcement learning (RL) can interleave searching and reasoning for multi-hop reasoning tasks. However, they face two critical failure modes as the accumulating context becomes flooded with both crucial evidence and irrelevant information: (1) ineffective search chain construction that produces incorrect queries or omits retrieval of critical information, and (2) reasoning hijacking by peripheral evidence that causes models to misidentify distractors as valid evidence. To address these challenges, we propose **D$^2$Plan**, a **D**ual-agent **D**ynamic global **Plan**ning paradigm for complex retrieval-augmented reasoning. **D$^2$Plan** operates through the collaboration of a *Reasoner* and a *Purifier*: the *Reasoner* constructs explicit global plans during reasoning and dynamically adapts them based on retrieval feedback; the *Purifier* assesses retrieval relevance and condenses key information for the *Reasoner*. We further introduce a two-stage training framework consisting of supervised fine-tuning (SFT) cold-start on synthesized trajectories and RL with plan-oriented rewards to teach LLMs to master the **D$^2$Plan** paradigm. Extensive experiments demonstrate that **D$^2$Plan** enables more coherent multi-step reasoning and stronger resilience to irrelevant information, thereby achieving superior performance on challenging QA benchmarks.
翻译:近期通过强化学习(RL)训练的搜索增强大语言模型(LLM)能够交替进行搜索与推理以完成多跳推理任务。然而,随着累积的上下文被关键证据和无关信息所淹没,这些模型面临两个关键失效模式:(1)搜索链构建失效,即生成错误查询或遗漏关键信息检索;(2)推理过程被边缘证据劫持,导致模型将干扰项误判为有效证据。为解决这些挑战,我们提出 **D²Plan**,一种面向复杂检索增强推理的**双**智能体**动**态全局**规划**范式。**D²Plan** 通过*推理器*与*净化器*的协作运行:*推理器*在推理过程中构建显式的全局规划,并基于检索反馈动态调整;*净化器*评估检索相关性并为*推理器*浓缩关键信息。我们进一步引入一个两阶段训练框架,包括基于合成轨迹的监督微调(SFT)冷启动和带有规划导向奖励的强化学习,以教导大语言模型掌握 **D²Plan** 范式。大量实验表明,**D²Plan** 能够实现更连贯的多步推理,并对无关信息具有更强的鲁棒性,从而在具有挑战性的问答基准测试中取得了优越性能。