Research Agents enable models to gather information from the web using tools to answer user queries, requiring them to dynamically interleave internal reasoning with tool use. While such capabilities can in principle be learned via reinforcement learning with verifiable rewards (RLVR), we observe that agents often exhibit poor exploration behaviors, including premature termination and biased tool usage. As a result, RLVR alone yields limited improvements. We propose SynPlanResearch-R1, a framework that synthesizes tool-use trajectories that encourage deeper exploration to shape exploration during cold-start supervised fine-tuning, providing a strong initialization for subsequent RL. Across seven multi-hop and open-web benchmarks, \framework improves performance by up to 6.0% on Qwen3-8B and 5.8% on Qwen3-4B backbones respectively compared to SOTA baselines. Further analyses of tool-use patterns and training dynamics compared to baselines shed light on the factors underlying these gains. Our code is publicly available at https://github.com/HansiZeng/syn-plan-research.
翻译:研究智能体使模型能够利用工具从网络收集信息以回答用户查询,这要求它们动态地交织内部推理与工具使用。虽然此类能力原则上可以通过带有可验证奖励的强化学习来习得,但我们观察到智能体常常表现出不良的探索行为,包括过早终止和工具使用偏差。因此,仅靠RLVR带来的改进有限。我们提出了SynPlanResearch-R1框架,该框架合成鼓励深度探索的工具使用轨迹,以在冷启动监督微调阶段引导探索,为后续强化学习提供良好的初始化。在七个多跳和开放网络基准测试中,与最先进的基线相比,\framework 在Qwen3-8B和Qwen3-4B骨干模型上分别实现了高达6.0%和5.8%的性能提升。与基线相比,对工具使用模式和训练动态的进一步分析揭示了这些收益背后的关键因素。我们的代码已在 https://github.com/HansiZeng/syn-plan-research 公开。