Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification

Recent advances in Deep Research Agents (DRAs) are transforming automated knowledge discovery and problem-solving. While the majority of existing efforts focus on enhancing policy capabilities via post-training, we propose an alternative paradigm: self-evolving the agent's ability by iteratively verifying the policy model's outputs, guided by meticulously crafted rubrics. This approach gives rise to the inference-time scaling of verification, wherein an agent self-improves by evaluating its generated answers to produce iterative feedback and refinements. We derive the rubrics based on an automatically constructed DRA Failure Taxonomy, which systematically classifies agent failures into five major categories and thirteen sub-categories. We present DeepVerifier, a rubrics-based outcome reward verifier that leverages the asymmetry of verification and outperforms vanilla agent-as-judge and LLM judge baselines by 12%-48% in meta-evaluation F1 score. To enable practical self-evolution, DeepVerifier integrates as a plug-and-play module during test-time inference. The verifier produces detailed rubric-based feedback, which is fed back to the agent for iterative bootstrapping, refining responses without additional training. This test-time scaling delivers 8%-11% accuracy gains on challenging subsets of GAIA and XBench-DeepResearch when powered by capable closed-source LLMs. Finally, to support open-source advancement, we release DeepVerifier-4K, a curated supervised fine-tuning dataset of 4,646 high-quality agent steps focused on DRA verification. These examples emphasize reflection and self-critique, enabling open models to develop robust verification capabilities.

翻译：深度研究智能体（DRAs）的最新进展正在改变自动化知识发现与问题解决的范式。现有研究大多通过后训练增强策略能力，本文提出一种替代范式：基于精心设计的准则迭代验证策略模型的输出，从而自主演化智能体的能力。该方法催生了推理时验证的规模化机制，智能体通过评估自身生成的答案产生迭代反馈与优化，实现自我改进。我们基于自动构建的DRA故障分类体系推导验证准则，该系统将智能体故障系统性地划分为5个主要类别和13个子类别。我们提出DeepVerifier——一种基于准则的结果奖励验证器，它利用验证过程的不对称性，在元评估F1分数上以12%-48%的优势超越原始智能体即法官和LLM法官基线。为实现实际自演化，DeepVerifier作为即插即用模块集成于测试时推理流程。该验证器生成基于准则的详细反馈，反馈至智能体进行迭代自举优化，无需额外训练即可精炼回答。在强大闭源LLM支持下，该方法在GAIA和XBench-DeepResearch的挑战性子集上实现了8%-11%的准确率提升。最后，为促进开源生态发展，我们发布了DeepVerifier-4K——一个包含4,646个高质量智能体步骤的监督微调数据集，专注于DRA验证任务。这些示例强调反思与自我批判能力，使开源模型能够发展出鲁棒的验证能力。