State-of-the-art large language models (LLMs) exhibit impressive problem-solving capabilities but may struggle with complex reasoning and factual correctness. Existing methods harness the strengths of chain-of-thought and retrieval-augmented generation (RAG) to decompose a complex problem into simpler steps and apply retrieval to improve factual correctness. These methods work well on straightforward reasoning tasks but often falter on challenging tasks such as competitive programming and mathematics, due to frequent reasoning errors and irrelevant knowledge retrieval. To address this, we introduce Critic-guided planning with Retrieval-augmentation, CR-Planner, a novel framework that leverages fine-tuned critic models to guide both reasoning and retrieval processes through planning. CR-Planner solves a problem by iteratively selecting and executing sub-goals. Initially, it identifies the most promising sub-goal from reasoning, query generation, and retrieval, guided by rewards given by a critic model named sub-goal critic. It then executes this sub-goal through sampling and selecting the optimal output based on evaluations from another critic model named execution critic. This iterative process, informed by retrieved information and critic models, enables CR-Planner to effectively navigate the solution space towards the final answer. We employ Monte Carlo Tree Search to collect the data for training the critic models, allowing for a systematic exploration of action sequences and their long-term impacts. We validate CR-Planner on challenging domain-knowledge-intensive and reasoning-heavy tasks, including competitive programming, theorem-driven math reasoning, and complex domain retrieval problems. Our experiments demonstrate that CR-Planner significantly outperforms baselines, highlighting its effectiveness in addressing challenging problems by improving both reasoning and retrieval.
翻译:当前最先进的大语言模型展现出令人印象深刻的问题解决能力,但在复杂推理与事实准确性方面仍存在局限。现有方法通过结合思维链与检索增强生成技术,将复杂问题分解为简单步骤并利用检索提升事实准确性。这些方法在常规推理任务上表现良好,但在竞争性编程与数学等复杂任务中,常因频繁的推理错误与无关知识检索而失效。为此,我们提出一种基于检索增强的批判者引导规划框架CR-Planner,该框架通过微调的批判者模型指导推理与检索过程。CR-Planner通过迭代选择并执行子目标来解决问题:首先,在名为子目标批判者的模型给出的奖励引导下,从推理、查询生成与检索中识别最有潜力的子目标;随后,通过采样生成候选输出,并依据名为执行批判者的模型评估选择最优结果。这一迭代过程结合检索信息与批判者模型,使CR-Planner能在解空间中有效导航直至获得最终答案。我们采用蒙特卡洛树搜索收集训练批判者模型所需数据,以系统探索行动序列及其长期影响。我们在竞争性编程、定理驱动的数学推理及复杂领域检索等需要密集领域知识与强推理能力的任务上验证了CR-Planner的有效性。实验结果表明,CR-Planner显著超越基线方法,证明其通过协同优化推理与检索过程,能够有效解决复杂问题。