Deep Research agents tackle knowledge-intensive tasks through multi-round retrieval and decision-oriented generation. While reinforcement learning (RL) has been shown to improve performance in this paradigm, its contributions remain underexplored. To fully understand the role of RL, we conduct a systematic study along three decoupled dimensions: prompt template, reward function, and policy optimization. Our study reveals that: 1) the Fast Thinking template yields greater stability and better performance than the Slow Thinking template used in prior work; 2) the F1-based reward underperforms the EM due to training collapse driven by answer avoidance; this can be mitigated by incorporating action-level penalties, ultimately surpassing EM; 3) REINFORCE outperforms PPO while requiring fewer search actions, whereas GRPO shows the poorest stability among policy optimization methods. Building on these insights, we then introduce Search-R1++, a strong baseline that improves the performance of Search-R1 from 0.403 to 0.442 (Qwen2.5-7B) and 0.289 to 0.331 (Qwen2.5-3B). We hope that our findings can pave the way for more principled and reliable RL training strategies in Deep Research systems.
翻译:深度研究智能体通过多轮检索与面向决策的生成来处理知识密集型任务。尽管强化学习已被证明能提升该范式下的性能,但其具体贡献仍未得到充分探究。为全面理解强化学习的作用,我们沿三个解耦维度展开系统性研究:提示模板、奖励函数与策略优化。我们的研究发现:1)与先前工作中使用的慢思考模板相比,快思考模板能带来更高的稳定性与更好的性能;2)基于F1的奖励因答案回避导致的训练崩溃而表现不及精确匹配奖励;通过引入动作级惩罚可缓解此问题,并最终超越精确匹配奖励;3)在策略优化方法中,REINFORCE在需要更少搜索动作的同时表现优于PPO,而GRPO则展现出最差的稳定性。基于这些洞见,我们进一步提出了Search-R1++,这一强基线将Search-R1的性能从0.403提升至0.442(Qwen2.5-7B)以及从0.289提升至0.331(Qwen2.5-3B)。我们希望本研究能为深度研究系统中更原则化、更可靠的强化学习训练策略铺平道路。