With the rapid growth of Web-based academic publications, more and more papers are being published annually, making it increasingly difficult to find relevant prior work. Citation prediction aims to automatically suggest appropriate references, helping scholars navigate the expanding scientific literature. Here we present \textbf{CiteRAG}, the first comprehensive retrieval-augmented generation (RAG)-integrated benchmark for evaluating large language models on academic citation prediction, featuring a multi-level retrieval strategy, specialized retrievers, and generators. Our benchmark makes four core contributions: (1) We establish two instances of the citation prediction task with different granularity. Task 1 focuses on coarse-grained list-specific citation prediction, while Task 2 targets fine-grained position-specific citation prediction. To enhance these two tasks, we build a dataset containing 7,267 instances for Task 1 and 8,541 instances for Task 2, enabling comprehensive evaluation of both retrieval and generation. (2) We construct a three-level large-scale corpus with 554k papers spanning many major subfields, using an incremental pipeline. (3) We propose a multi-level hybrid RAG approach for citation prediction, fine-tuning embedding models with contrastive learning to capture complex citation relationships, paired with specialized generation models. (4) We conduct extensive experiments across state-of-the-art language models, including closed-source APIs, open-source models, and our fine-tuned generators, demonstrating the effectiveness of our framework. Our open-source toolkit enables reproducible evaluation and focuses on academic literature, providing the first comprehensive evaluation framework for citation prediction and serving as a methodological template for other scientific domains. Our source code and data are released at https://github.com/LQgdwind/CiteRAG.
翻译:随着基于网络的学术出版物快速增长,每年发表的论文数量日益增多,这使得寻找相关前期工作变得越来越困难。引文预测旨在自动推荐合适的参考文献,帮助学者在日益扩展的科学文献中导航。本文提出\textbf{CiteRAG},这是首个全面集成检索增强生成(RAG)的基准测试,用于评估大型语言模型在学术引文预测任务上的表现,其特点包括多层级检索策略、专用检索器与生成器。我们的基准测试作出四项核心贡献:(1)我们建立了两个不同粒度的引文预测任务实例。任务1关注粗粒度的列表式引文预测,而任务2则针对细粒度的位置特定引文预测。为增强这两项任务,我们构建了包含7,267个任务1实例和8,541个任务2实例的数据集,实现了对检索与生成能力的全面评估。(2)我们通过增量式流程构建了包含55.4万篇论文的三层级大规模语料库,涵盖众多主要子领域。(3)我们提出了一种用于引文预测的多层级混合RAG方法,通过对比学习微调嵌入模型以捕捉复杂的引文关系,并与专用生成模型相结合。(4)我们在包括闭源API、开源模型及我们微调的生成器在内的前沿语言模型上进行了广泛实验,证明了我们框架的有效性。我们开源的工具包支持可复现的评估,专注于学术文献领域,为引文预测提供了首个全面的评估框架,并可作为其他科学领域的方法学模板。我们的源代码和数据发布于https://github.com/LQgdwind/CiteRAG。