Recent deep search agents built on large reasoning models (LRMs) excel at complex question answering by iteratively planning, acting, and gathering evidence, a capability known as search-integrated reasoning. However, mainstream approaches often train this ability using only outcome-based supervision, neglecting the quality of intermediate thoughts and actions. We introduce SRR-Judge, a framework for reliable step-level assessment of reasoning and search actions. Integrated into a modified ReAct-style rate-and-refine workflow, SRR-Judge provides fine-grained guidance for search-integrated reasoning and enables efficient post-training annotation. Using SRR-annotated data, we apply an iterative rejection sampling fine-tuning procedure to enhance the deep search capability of the base agent. Empirically, SRR-Judge delivers more reliable step-level evaluations than much larger models such as DeepSeek-V3.1, with its ratings showing strong correlation with final answer correctness. Moreover, aligning the policy with SRR-Judge annotated trajectories leads to substantial performance gains, yielding over a 10 percent average absolute pass@1 improvement across challenging deep search benchmarks.
翻译:近年来,基于大型推理模型构建的深度搜索智能体在复杂问答任务中表现出色,这得益于其迭代地进行规划、执行和收集证据的能力,即所谓的搜索集成推理。然而,主流方法通常仅使用基于结果的监督来训练这种能力,忽视了中间思考与行动步骤的质量。我们提出了SRR-Judge,一个用于对推理和搜索行动进行可靠步骤级评估的框架。该框架被集成到一个改进的ReAct风格“评分-精炼”工作流中,为搜索集成推理提供细粒度指导,并支持高效的训练后标注。利用SRR标注的数据,我们应用了一种迭代拒绝采样微调程序,以增强基础智能体的深度搜索能力。实验表明,SRR-Judge能提供比DeepSeek-V3.1等规模大得多的模型更可靠的步骤级评估,其评分与最终答案的正确性表现出强相关性。此外,将策略与SRR-Judge标注的轨迹对齐带来了显著的性能提升,在多个具有挑战性的深度搜索基准测试中,平均绝对pass@1指标提升了超过10个百分点。