Generating a Structured Summary of Numerous Academic Papers: Dataset and Method

Writing a survey paper on one research topic usually needs to cover the salient content from numerous related papers, which can be modeled as a multi-document summarization (MDS) task. Existing MDS datasets usually focus on producing the structureless summary covering a few input documents. Meanwhile, previous structured summary generation works focus on summarizing a single document into a multi-section summary. These existing datasets and methods cannot meet the requirements of summarizing numerous academic papers into a structured summary. To deal with the scarcity of available data, we propose BigSurvey, the first large-scale dataset for generating comprehensive summaries of numerous academic papers on each topic. We collect target summaries from more than seven thousand survey papers and utilize their 430 thousand reference papers' abstracts as input documents. To organize the diverse content from dozens of input documents and ensure the efficiency of processing long text sequences, we propose a summarization method named category-based alignment and sparse transformer (CAST). The experimental results show that our CAST method outperforms various advanced summarization methods.

翻译：撰写某一研究主题的综述论文通常需要涵盖大量相关论文中的核心内容，这可以被建模为多文档摘要任务。现有的多文档摘要数据集通常侧重于生成覆盖少量输入文档的非结构化摘要。同时，先前的结构化摘要生成工作主要关注将单篇文档总结为多章节摘要。这些现有数据集和方法无法满足将大量学术论文总结为结构化摘要的需求。为解决可用数据稀缺的问题，我们提出了BigSurvey——首个用于生成每个主题下大量学术论文综合摘要的大规模数据集。我们从七千余篇综述论文中收集目标摘要，并利用其43万篇参考文献的摘要作为输入文档。为组织数十篇输入文档中的多样化内容并确保长文本序列的处理效率，我们提出了一种名为基于类别对齐的稀疏Transformer的摘要方法。实验结果表明，我们的CAST方法优于各类先进摘要方法。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日