We study large-scale literature search from two complementary angles: improving the retrieval pipeline, and stress-testing the human reference list as an evaluation target. First, we implement a Deep Research pipeline that processes the full query paper and expands the retrieved results breadth-first along their bibliographies, and show that it substantially outperforms vanilla API-only search, raising recall on RollingEval-Jun25 (a 250-paper literature-search benchmark) from below 20% to above 80%. Second, we use a neutral LLM-as-a-judge to determine if human references are sound ground truth for the task. We find significant limitations: only 51% of human citations are judged moderately relevant or higher, against 86--88% for the strongest AI-based re-rankers. We study this gap on the OpenAlex co-authorship graph, finding that humans are 2.5x more likely than the best AI re-rankers to cite a direct collaborator. Together, our results argue against single-axis literature-search evaluation: recall, topical-relevance scoring, ranked-list diversity, and a co-authorship-distance diagnostic each measure complementary properties of citation quality and should be reported jointly.
翻译:我们从两个互补的角度研究大规模文献检索:改进检索流程,以及对作为评估目标的人工参考文献列表进行压力测试。首先,我们实现了一个深度搜索流程,该流程处理全文查询论文,并沿其参考文献广度优先地扩展检索结果。实验表明,该方法显著优于仅依赖API的原始搜索,将RollingEval-Jun25(一个包含250篇论文的文献检索基准)上的召回率从低于20%提升至高于80%。其次,我们采用中立的LLM作为评判者,以确定人工参考文献是否为该任务的可靠金标准。结果发现存在显著局限性:仅51%的人工引文被评为“中等相关”或更高,而最强AI重排序器的这一比例为86-88%。我们在OpenAlex合著关系图上研究了这一差距,发现与最强的AI重排序器相比,人工引用直接合作者的可能性高出2.5倍。综合来看,我们的结果反对单一维度的文献检索评估:召回率、主题相关性评分、排序列表多样性以及合著距离诊断,各自衡量引用质量的不同方面,应联合报告。