The explosive growth of AI and machine learning literature -- with venues like NeurIPS and ICLR now accepting thousands of papers annually -- has made comprehensive citation coverage increasingly difficult for researchers. While citation recommendation has been studied for over a decade, existing systems primarily focus on broad relevance rather than identifying the critical set of ``must-cite'' papers: direct experimental baselines, foundational methods, and core dependencies whose omission would misrepresent a contribution's novelty or undermine reproducibility. We introduce MasterSet, a large-scale benchmark specifically designed to evaluate must-cite recommendation in the AI/ML domain. MasterSet incorporates over 150,000 papers collected from official conference proceedings/websites of 15 leading venues, serving as a comprehensive candidate pool for retrieval. We annotate citations with a three-tier labeling scheme: (I) experimental baseline status, (II) core relevance (1--5 scale), and (III) intra-paper mention frequency. Our annotation pipeline leverages an LLM-based judge, validated by human experts on a stratified sample. The benchmark task requires retrieving must-cite papers from the candidate pool given only a query paper's title and abstract, evaluated by Recall@$K$. We establish baselines using sparse retrieval, dense scientific embeddings, and graph-based methods, demonstrating that must-cite retrieval remains a challenging open problem.
翻译:人工智能与机器学习文献的爆炸式增长——以NeurIPS和ICLR为代表的学术会议每年收录数千篇论文——使得研究者全面覆盖引用文献日益困难。尽管引用推荐研究已开展十余年,现有系统主要关注广义相关性,而非识别关键的"必引"文献集:包括直接实验基线、基础性方法及核心依赖项,遗漏此类文献将导致研究贡献的独创性表述失准或削弱结果可复现性。我们提出MasterSet,这是一个专为评估AI/ML领域必引文献推荐设计的大规模基准测试。MasterSet整合了来自15个顶级学术会议官方论文集/网站的逾15万篇论文,构建了全面的候选文献池。我们采用三级标注体系对引用进行分类:(I)实验基线状态,(II)核心相关性(1-5级评分),(III)论文内提及频率。标注流程采用基于LLM的评审机制,并通过分层抽样由人类专家验证。基准任务要求仅依据查询论文的标题和摘要,从候选文献池中检索必引文献,以Recall@$K$为评估指标。我们采用稀疏检索、密集科学嵌入及图方法建立基线,结果表明必引文献检索仍是具有挑战性的开放问题。