Nowadays, training and evaluating DeepResearch-generated reports remain challenging due to the lack of verifiable reward signals. Accordingly, rubric-based evaluation has become a common practice. However, existing approaches either rely on coarse, pre-defined rubrics that lack sufficient granularity, or depend on manually constructed query-specific rubrics that are costly and difficult to scale. In this paper, we propose a pipeline to train human-preference-aligned query-specific rubric generators tailored for DeepResearch report generation. We first construct a dataset of DeepResearch-style queries annotated with human preferences over paired reports, and train rubric generators via reinforcement learning with a hybrid reward combining human preference supervision and LLM-based rubric evaluation. To better handle long-horizon reasoning, we further introduce a Multi-agent Markov-state (MaMs) workflow for report generation. We empirically show that our proposed rubric generators deliver more discriminative and better human-aligned supervision than existing rubric design strategies. Moreover, when integrated into the MaMs training framework, DeepResearch systems equipped with our rubric generators consistently outperform all open-source baselines on the DeepResearch Bench and achieve performance comparable to that of leading closed-source models.
翻译:当前,由于缺乏可验证的奖励信号,训练和评估深度研究生成报告仍然具有挑战性。因此,基于评估准则的评价已成为常见做法。然而,现有方法要么依赖于缺乏足够细粒度的、预先定义的粗略评估准则,要么依赖于手动构建的查询特定评估准则,这些方法成本高昂且难以扩展。本文提出一种流程,用于训练与人类偏好对齐的、专为深度研究报告生成定制的查询特定评估准则生成器。我们首先构建一个包含深度研究风格查询的数据集,其中标注了人类对成对报告的偏好,并通过结合人类偏好监督和基于LLM的评估准则评估的混合奖励,使用强化学习训练评估准则生成器。为了更好地处理长程推理,我们进一步引入了一种多智能体马尔可夫状态工作流用于报告生成。实证结果表明,我们提出的评估准则生成器比现有的评估准则设计策略提供了更具区分性且更符合人类偏好的监督。此外,当集成到多智能体马尔可夫状态训练框架中时,配备我们评估准则生成器的深度研究系统在深度研究基准测试中持续优于所有开源基线,并达到了与领先闭源模型相当的性能。