While search-augmented large language models (LLMs) exhibit impressive capabilities, their reliability in complex multi-hop reasoning remains limited. This limitation arises from three fundamental challenges: decomposition errors, where tasks are incorrectly broken down; retrieval missing, where key evidence fails to be retrieved; and reasoning errors, where flawed logic propagates through the reasoning chain. A single failure in any of these stages can derail the final answer. We propose Erasable Reinforcement Learning (ERL), a novel framework that transforms fragile reasoning into a robust process. ERL explicitly identifies faulty steps, erases them, and regenerates reasoning in place, preventing defective logic from propagating through the reasoning chain. This targeted correction mechanism turns brittle reasoning into a more resilient process. Models trained with ERL, termed ESearch, achieve substantial improvements on HotpotQA, MuSiQue, 2Wiki, and Bamboogle, with the 3B model achieving +8.48% EM and +11.56% F1, and the 7B model achieving +5.38% EM and +7.22% F1 over previous state-of-the-art(SOTA) results. These findings suggest that erasable reinforcement learning provides a powerful paradigm shift for robust multi-step reasoning in LLMs.
翻译:尽管检索增强型大语言模型展现出卓越能力,但在复杂多跳推理任务中的可靠性仍存在局限。这一限制源于三个核心挑战:任务分解错误(错误拆解推理步骤)、证据检索缺失(关键证据无法获取)以及推理偏差(错误逻辑在推理链中传播)。任何阶段的单点故障都可能导致最终答案偏离。我们提出可擦除增强学习框架,通过将脆弱推理转化为稳健流程实现根本性突破。该框架能精准识别错误步骤,将其擦除后重新生成推理逻辑,有效阻断缺陷逻辑在推理链中的传播。这种靶向修正机制将脆弱的推理过程重塑为更具鲁棒性的系统。基于此框架训练的ESearch模型在HotpotQA、MuSiQue、2Wiki和Bamboogle基准测试中取得显著提升:3B参数模型相较此前最优结果实现+8.48%的精确匹配率和+11.56%的F1值提升,7B参数模型则分别提升+5.38%和+7.22%。研究表明,可擦除增强学习为大语言模型实现稳健的多步推理提供了范式性突破。