Deep research agents synthesize long-form reports by searching and reasoning over retrieved evidence. Reinforcement learning with rubric-based rewards improves these agents by optimizing them against checkable criteria that translate report quality into reward signals, but its efficiency depends on whether those criteria reliably capture the task scope and evidence needs. Most existing studies ask an LLM to generate rubrics for a given query, but when the model fails to infer the underlying information needs, the generated rubrics may be incomplete and reduce RL efficiency. To obtain more reliable query--rubric supervision, we introduce DeepRubric, a data construction framework that reverses this process: instead of inferring evaluation criteria for a given query, it first determines what an evidence-backed report should be evaluated on and then synthesizes aligned query--rubric pairs from those evaluation targets. Starting from a sampled seed topic, DeepRubric builds an evidence tree by recursively expanding evidence-backed sub-questions, whose leaves serve as atomic and verifiable evaluation targets. It then uses the evidence tree to synthesize the training query and rubrics, ensuring that the reward evaluates exactly the information requested by the query. Using DeepRubric, we construct 9K query--rubric supervision examples and train DeepRubric-8B with rubric-based GRPO, achieving comparable performance to prior open state-of-the-art deep research models across three benchmarks with roughly 13x fewer RL GPU-hours.
翻译:深度研究智能体通过检索并推理所获取的证据来生成长篇研究报告。基于评价标准的强化学习通过将报告质量转化为奖励信号的可核查准则来优化此类智能体,但其效率取决于这些准则能否可靠地覆盖任务范围与证据需求。现有研究多由大语言模型针对给定查询生成评价标准,但当模型无法推断潜在信息需求时,所生成的评价标准可能不完整,从而降低强化学习效率。为获取更可靠的查询-评价标准监督信号,我们提出DeepRubric数据构建框架,该框架逆转了传统流程:并非为给定查询推断评估准则,而是先确定基于证据的报告应被评估的维度,再基于这些评估目标合成对齐的查询-评价标准对。从采样的种子主题出发,DeepRubric通过递归扩展基于证据的子问题构建证据树,其叶节点即为原子化且可验证的评估目标。随后利用证据树合成训练查询与评价标准,确保奖励信号精确评估查询所要求的信息。借助DeepRubric,我们构建了9千组查询-评价标准监督样本,并基于评价标准的GRPO训练出DeepRubric-8B模型。该模型在三个基准测试中达到与先前开源最佳深度研究模型相当的性能,而强化学习GPU耗时减少了约13倍。