MemSearcher：通过端到端强化学习训练大语言模型进行推理、搜索与记忆管理 (MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning)

Typical search agents concatenate the entire interaction history into the LLM context, preserving information integrity but producing long, noisy contexts, resulting in high computation and memory costs. In contrast, using only the current turn avoids this overhead but discards essential information. This trade-off limits the scalability of search agents. To address this challenge, we propose MemSearcher, an agent workflow that iteratively maintains a compact memory and combines the current turn with it. At each turn, MemSearcher fuses the user's question with the memory to generate reasoning traces, perform search actions, and update memory to retain only information essential for solving the task. This design stabilizes context length across multi-turn interactions, improving efficiency without sacrificing accuracy. To optimize this workflow, we introduce multi-context GRPO, an end-to-end RL framework that jointly optimize reasoning, search strategies, and memory management of MemSearcher Agents. Specifically, multi-context GRPO samples groups of trajectories under different contexts and propagates trajectory-level advantages across all conversations within them. Trained on the same dataset as Search-R1, MemSearcher achieves significant improvements over strong baselines on seven public benchmarks: +11% on Qwen2.5-3B-Instruct and +12% on Qwen2.5-7B-Instruct relative average gains. Notably, the 3B-based MemSearcher even outperforms 7B-based baselines, demonstrating that striking a balance between information integrity and efficiency yields both higher accuracy and lower computational overhead. The code and models will be publicly available at https://github.com/icip-cas/MemSearcher

翻译：典型的搜索代理将整个交互历史拼接为大语言模型的上下文，虽能保持信息完整性，但会产生冗长且嘈杂的上下文，导致高昂的计算与内存开销。相反，若仅使用当前轮次信息可避免此开销，却会丢弃关键信息。这种权衡限制了搜索代理的可扩展性。为应对这一挑战，我们提出MemSearcher——一种迭代维护紧凑记忆库并将其与当前轮次信息相结合的智能体工作流。在每一轮交互中，MemSearcher将用户问题与记忆库融合，生成推理轨迹，执行搜索动作，并更新记忆库以仅保留解决任务所必需的信息。该设计使多轮交互中的上下文长度保持稳定，在保证准确性的同时提升了效率。为优化此工作流，我们提出了多上下文GRPO——一种端到端强化学习框架，可联合优化MemSearcher智能体的推理、搜索策略与记忆管理。具体而言，多上下文GRPO在不同上下文环境下采样轨迹组，并将轨迹级优势值在组内所有对话中进行传播。在与Search-R1相同的数据集上训练后，MemSearcher在七个公开基准测试中显著超越强基线模型：Qwen2.5-3B-Instruct相对平均增益提升11%，Qwen2.5-7B-Instruct提升12%。值得注意的是，基于3B参数的MemSearcher甚至优于基于7B参数的基线模型，这表明在信息完整性与效率间取得平衡，既能获得更高精度，又可降低计算开销。代码与模型将公开于https://github.com/icip-cas/MemSearcher。