Scientific paper retrieval, particularly framed as document-to-document retrieval, aims to identify relevant papers in response to a long-form query paper, rather than a short query string. Previous approaches to this task have focused exclusively on abstracts, embedding them into dense vectors as surrogates for full documents and calculating similarity between them. Yet, abstracts offer only sparse and high-level summaries, and such methods primarily optimize one-to-one similarity, overlooking the dynamic relations that emerge across relevant papers during the retrieval process. To address this, we propose Chain of Retrieval(COR), a novel iterative framework for full-paper retrieval. Specifically, COR decomposes each query paper into multiple aspect-specific views, matches them against segmented candidate papers, and iteratively expands the search by promoting top-ranked results as new queries, thereby forming a tree-structured retrieval process. The resulting retrieval tree is then aggregated in a post-order manner: descendants are first combined at the query level, then recursively merged with their parent nodes, to capture hierarchical relations across iterations. To validate this, we present SCIFULLBENCH, a large-scale benchmark providing both complete and segmented contexts of full papers for queries and candidates, and results show that COR significantly outperforms existing retrieval baselines. Our code and dataset is available at https://github.com/psw0021/Chain-of-Retrieval-Official.
翻译:科学论文检索(尤其是文档到文档检索)旨在根据长格式查询论文(而非短查询字符串)识别相关论文。以往方法仅关注论文摘要,将其编码为稠密向量以代表全文并计算相似度。然而摘要仅提供稀疏且高层次的总结,且此类方法主要优化一对一相似度,忽略了检索过程中相关论文间涌现的动态关系。为此,我们提出链式检索(Chain of Retrieval,简称COR)——一种全新的迭代式全文检索框架。具体而言,COR将每篇查询论文分解为多个特定方面视图,与分段的候选论文进行匹配,并通过将排名前位的结果提升为新查询实现迭代搜索扩展,从而形成树状检索过程。随后通过后序方式聚合检索树:子节点在查询层面先行合并,再递归融入父节点以捕获迭代间的层次关系。为验证该方法,我们提出大规模基准测试SCIFULLBENCH,提供全文查询和候选集的完整及分段上下文,实验表明COR显著优于现有检索基线。我们的代码和数据集已开源至https://github.com/psw0021/Chain-of-Retrieval-Official。