Deep research agents have emerged as powerful systems for addressing complex queries. Meanwhile, LLM-based retrievers have demonstrated strong capability in following instructions or reasoning. This raises a critical question: can LLM-based retrievers effectively contribute to deep research agent workflows? To investigate this, we introduce SAGE, a benchmark for scientific literature retrieval comprising 1,200 queries across four scientific domains, with a 200,000 paper retrieval corpus. We evaluate six deep research agents and find that all systems struggle with reasoning-intensive retrieval. Using DR Tulu as backbone, we further compare BM25 and LLM-based retrievers (i.e., ReasonIR and gte-Qwen2-7B-instruct) as alternative search tools. Surprisingly, BM25 significantly outperforms LLM-based retrievers by approximately 30%, as existing agents generate keyword-oriented sub-queries. To improve performance, we propose a corpus-level test-time scaling framework that uses LLMs to augment documents with metadata and keywords, making retrieval easier for off-the-shelf retrievers. This yields 8% and 2% gains on short-form and open-ended questions, respectively.
翻译:深度研究智能体已成为处理复杂查询的强大系统。与此同时,基于大语言模型(LLM)的检索器在遵循指令和推理方面展现出卓越能力。这引出了一个关键问题:基于LLM的检索器能否有效支持深度研究智能体的工作流程?为探究此问题,我们提出了SAGE——一个面向科学文献检索的基准数据集,涵盖四个科学领域的1,200个查询,并包含20万篇论文的检索语料库。通过对六种深度研究智能体的评估,我们发现所有系统在处理需要深度推理的检索任务时均存在困难。以DR Tulu为骨干框架,我们进一步比较了BM25与基于LLM的检索器(即ReasonIR和gte-Qwen2-7B-instruct)作为替代搜索工具的效果。令人惊讶的是,由于现有智能体倾向于生成基于关键词的子查询,BM25检索器的性能显著优于基于LLM的检索器约30%。为提升检索性能,我们提出了一种语料库级别的测试时扩展框架,该框架利用LLM为文档自动添加元数据和关键词,从而降低现成检索器的使用门槛。该框架在简答题和开放式问题上的检索准确率分别提升了8%和2%。