Writing a survey paper on one research topic usually needs to cover the salient content from numerous related papers, which can be modeled as a multi-document summarization (MDS) task. Existing MDS datasets usually focus on producing the structureless summary covering a few input documents. Meanwhile, previous structured summary generation works focus on summarizing a single document into a multi-section summary. These existing datasets and methods cannot meet the requirements of summarizing numerous academic papers into a structured summary. To deal with the scarcity of available data, we propose BigSurvey, the first large-scale dataset for generating comprehensive summaries of numerous academic papers on each topic. We collect target summaries from more than seven thousand survey papers and utilize their 430 thousand reference papers' abstracts as input documents. To organize the diverse content from dozens of input documents and ensure the efficiency of processing long text sequences, we propose a summarization method named category-based alignment and sparse transformer (CAST). The experimental results show that our CAST method outperforms various advanced summarization methods.
翻译:撰写某一研究主题的综述论文通常需要涵盖大量相关论文中的核心内容,这可以被建模为多文档摘要任务。现有的多文档摘要数据集通常侧重于生成覆盖少量输入文档的非结构化摘要。同时,先前的结构化摘要生成工作主要关注将单篇文档总结为多章节摘要。这些现有数据集和方法无法满足将大量学术论文总结为结构化摘要的需求。为解决可用数据稀缺的问题,我们提出了BigSurvey——首个用于生成每个主题下大量学术论文综合摘要的大规模数据集。我们从七千余篇综述论文中收集目标摘要,并利用其43万篇参考文献的摘要作为输入文档。为组织数十篇输入文档中的多样化内容并确保长文本序列的处理效率,我们提出了一种名为基于类别对齐的稀疏Transformer的摘要方法。实验结果表明,我们的CAST方法优于各类先进摘要方法。