With the rapid growth of Web-based academic publications, more and more papers are being published annually, making it increasingly difficult to find relevant prior work. Citation prediction aims to automatically suggest appropriate references, helping scholars navigate the expanding scientific literature. Here we present \textbf{CiteRAG}, the first comprehensive retrieval-augmented generation (RAG)-integrated benchmark for evaluating large language models on academic citation prediction, featuring a multi-level retrieval strategy, specialized retrievers, and generators. Our benchmark makes four core contributions: (1) We establish two instances of the citation prediction task with different granularity. Task 1 focuses on coarse-grained list-specific citation prediction, while Task 2 targets fine-grained position-specific citation prediction. To enhance these two tasks, we build a dataset containing 7,267 instances for Task 1 and 8,541 instances for Task 2, enabling comprehensive evaluation of both retrieval and generation. (2) We construct a three-level large-scale corpus with 554k papers spanning many major subfields, using an incremental pipeline. (3) We propose a multi-level hybrid RAG approach for citation prediction, fine-tuning embedding models with contrastive learning to capture complex citation relationships, paired with specialized generation models. (4) We conduct extensive experiments across state-of-the-art language models, including closed-source APIs, open-source models, and our fine-tuned generators, demonstrating the effectiveness of our framework. Our open-source toolkit enables reproducible evaluation and focuses on academic literature, providing the first comprehensive evaluation framework for citation prediction and serving as a methodological template for other scientific domains. Our source code and data are released at https://github.com/LQgdwind/CiteRAG.
翻译:随着基于网络的学术出版物快速增长,每年发表的论文数量日益增多,使得寻找相关前期工作变得越来越困难。引文预测旨在自动推荐合适的参考文献,帮助学者在日益扩大的科学文献中导航。本文提出\textbf{CiteRAG},这是首个全面集成检索增强生成(RAG)的基准,用于评估大语言模型在学术引文预测上的表现,其特点包括多级检索策略、专用检索器和生成器。我们的基准做出了四项核心贡献:(1)我们建立了两个不同粒度的引文预测任务实例。任务1侧重于粗粒度的列表特定引文预测,而任务2则针对细粒度的位置特定引文预测。为增强这两项任务,我们构建了一个数据集,包含任务1的7,267个实例和任务2的8,541个实例,从而实现对检索和生成的全面评估。(2)我们通过增量流水线构建了一个包含554k篇论文的三级大规模语料库,涵盖多个主要子领域。(3)我们提出了一种用于引文预测的多级混合RAG方法,通过对比学习微调嵌入模型以捕捉复杂的引文关系,并与专用生成模型配对。(4)我们在包括闭源API、开源模型和我们微调的生成器在内的最先进语言模型上进行了广泛实验,证明了我们框架的有效性。我们的开源工具包支持可复现的评估,并专注于学术文献,为引文预测提供了首个全面的评估框架,同时可作为其他科学领域的方法学模板。我们的源代码和数据发布于https://github.com/LQgdwind/CiteRAG。