This paper introduces MR-Search, an in-context meta reinforcement learning (RL) formulation for agentic search with self-reflection. Instead of optimizing a policy within a single independent episode with sparse rewards, MR-Search trains a policy that conditions on past episodes and adapts its search strategy across episodes. MR-Search learns to learn a search strategy with self-reflection, allowing search agents to improve in-context exploration at test-time. Specifically, MR-Search performs cross-episode exploration by generating explicit self-reflections after each episode and leveraging them as additional context to guide subsequent attempts, thereby promoting more effective exploration during test-time. We further introduce a multi-turn RL algorithm that estimates a dense relative advantage at the turn level, enabling fine-grained credit assignment on each episode. Empirical results across various benchmarks demonstrate the advantages of MR-Search over baselines based RL, showing strong generalization and relative improvements of 9.2% to 19.3% across eight benchmarks. Our code and data are available at https://github.com/tengxiao1/MR-Search.
翻译:本文提出MR-Search,一种面向自反思智能搜索的情境元强化学习框架。与传统在稀疏奖励的独立单回合中优化策略的方法不同,MR-Search训练的策略能够以历史回合为条件,并在跨回合中自适应调整搜索策略。该方法通过自反思机制学习如何学习搜索策略,使智能体在测试阶段能够实现情境化探索能力的持续提升。具体而言,MR-Search通过在每个回合后生成显式自反思记录,并将其作为额外情境信息指导后续搜索尝试,从而实现跨回合探索,促进测试阶段更有效的探索行为。我们进一步提出一种多轮次强化学习算法,该算法在轮次层面估计密集相对优势,实现对每个回合的细粒度信用分配。在多个基准测试上的实证结果表明,MR-Search相较于基于强化学习的基线方法具有显著优势,在八个基准测试中展现出强大的泛化能力,并取得9.2%至19.3%的相对性能提升。代码与数据已开源:https://github.com/tengxiao1/MR-Search。