The rapid growth of academic literature makes the manual creation of scientific surveys increasingly infeasible. While large language models show promise for automating this process, progress in this area is hindered by the absence of standardized benchmarks and evaluation protocols. To bridge this critical gap, we introduce SurGE (Survey Generation Evaluation), a new benchmark for scientific survey generation in computer science. SurGE consists of (1) a collection of test instances, each including a topic description, an expert-written survey, and its full set of cited references, and (2) a large-scale academic corpus of over one million papers. In addition, we propose an automated evaluation framework that measures the quality of generated surveys across four dimensions: comprehensiveness, citation accuracy, structural organization, and content quality. Our evaluation of diverse LLM-based methods demonstrates a significant performance gap, revealing that even advanced agentic frameworks struggle with the complexities of survey generation and highlighting the need for future research in this area. We have open-sourced all the code, data, and models at: https://github.com/oneal2000/SurGE
翻译:学术文献的快速增长使得人工创建科学综述日益不可行。尽管大语言模型在自动化这一过程中展现出潜力,但该领域的进展因缺乏标准化的基准和评估协议而受阻。为弥合这一关键差距,我们提出了SurGE(综述生成评估),这是一个用于计算机科学领域科学综述生成的新基准。SurGE包含:(1)一组测试实例,每个实例包含主题描述、专家撰写的综述及其完整的引用文献集;(2)一个包含超过一百万篇论文的大规模学术语料库。此外,我们提出了一个自动化评估框架,从四个维度衡量生成综述的质量:全面性、引用准确性、结构组织性和内容质量。我们对多种基于大语言模型的方法进行评估,结果显示存在显著的性能差距,表明即使是先进的智能体框架也难以应对综述生成的复杂性,凸显了该领域未来研究的必要性。我们已在 https://github.com/oneal2000/SurGE 开源所有代码、数据和模型。