Beneficial Reasoning Behaviors in Agentic Search and Effective Post-training to Obtain Them

Agentic search requires large language models (LLMs) to perform multi-step search to solve complex information-seeking tasks, imposing unique challenges on their reasoning capabilities. However, what constitutes effective reasoning for agentic search and how it can be learned remains unclear. In this work, we first investigate the reasoning behaviors that enable success in agentic search. By comparing successful and failed trajectories via an LLM-based analysis pipeline, we identify four beneficial behaviors: Information Verification, Authority Evaluation, Adaptive Search, and Error Recovery. Building on this, we propose Behavior Priming, a training approach that equips agentic search models with these reasoning behaviors before reinforcement learning (RL). Specifically, it first performs supervised fine-tuning (SFT) on collected trajectories exhibiting the identified behaviors to cultivate these behaviors, and then applies standard RL to further improve task performance. Experiments on Qwen3-1.7B and Llama3.2-3B-Instruct show that Behavior Priming yields relative improvements over direct RL by 37.2\% on three web benchmarks and 6.2\% on seven multi-hop QA benchmarks, and outperforms the SFT-then-RL baseline using outcome-correct trajectories for fine-tuning. Crucially, we show that these reasoning behaviors matter more than outcome correctness in the priming stage prior to RL. Further analysis reveals that Behavior Priming enhances exploration (pass@8) and test-time scaling (search step number), providing a robust foundation for RL. Our code are avalible at https://github.com/cxcscmu/Behavior-Priming-for-Agentic-Search.

翻译：智能体搜索要求大型语言模型（LLM）执行多步搜索以解决复杂的信息获取任务，这对其推理能力提出了独特挑战。然而，何种推理对智能体搜索有效以及如何习得这种能力仍不明确。在本工作中，我们首先探究了促成智能体搜索成功的推理行为。通过基于LLM的分析流程比较成功与失败的轨迹，我们识别出四种有益行为：信息验证、权威性评估、自适应搜索和错误恢复。基于此，我们提出行为引导训练法——一种在强化学习（RL）前为智能体搜索模型赋予这些推理行为的训练方法。具体而言，该方法首先对展现目标行为的收集轨迹进行监督微调（SFT）以培养这些行为，随后应用标准RL进一步提升任务性能。在Qwen3-1.7B和Llama3.2-3B-Instruct上的实验表明，行为引导训练法在三个网页基准上相对直接RL获得37.2%的相对提升，在七个多跳问答基准上获得6.2%的提升，且优于使用结果正确轨迹进行微调的SFT-then-RL基线方法。关键的是，我们证明在RL前的引导阶段，这些推理行为比结果正确性更为重要。进一步分析表明，行为引导训练法能增强探索能力（pass@8）和测试时扩展性（搜索步数），为RL提供了稳健基础。我们的代码公开于https://github.com/cxcscmu/Behavior-Priming-for-Agentic-Search。