Multi-document summarization (MDS) refers to the task of summarizing the text in multiple documents into a concise summary. The generated summary can save the time of reading many documents by providing the important content in the form of a few sentences. Abstractive MDS aims to generate a coherent and fluent summary for multiple documents using natural language generation techniques. In this paper, we consider the unsupervised abstractive MDS setting where there are only documents with no groundtruh summaries provided, and we propose Absformer, a new Transformer-based method for unsupervised abstractive summary generation. Our method consists of a first step where we pretrain a Transformer-based encoder using the masked language modeling (MLM) objective as the pretraining task in order to cluster the documents into semantically similar groups; and a second step where we train a Transformer-based decoder to generate abstractive summaries for the clusters of documents. To our knowledge, we are the first to successfully incorporate a Transformer-based model to solve the unsupervised abstractive MDS task. We evaluate our approach using three real-world datasets from different domains, and we demonstrate both substantial improvements in terms of evaluation metrics over state-of-the-art abstractive-based methods, and generalization to datasets from different domains.
翻译:多文档摘要(MDS)是指将多篇文档中的文本内容提炼为简洁摘要的任务。生成的摘要能够以少量句子呈现关键信息,从而节省用户阅读大量文档的时间。抽象式多文档摘要旨在利用自然语言生成技术,为多篇文档生成连贯且流畅的摘要。本文研究了无监督抽象式多文档摘要场景(即仅提供文档,不提供参考摘要),并提出一种名为Absformer的基于Transformer的新方法,用于无监督抽象式摘要生成。该方法包含两个步骤:首先,以掩码语言建模(MLM)为预训练任务,训练基于Transformer的编码器,将文档聚类为语义相似的组;其次,训练基于Transformer的解码器,为各文档聚类组生成抽象式摘要。据我们所知,这是首次成功将Transformer模型应用于无监督抽象式多文档摘要任务。我们采用来自不同领域的三个真实数据集对方法进行评估,结果表明,该方法在评估指标上显著优于现有最先进的抽象式方法,并且能够泛化至不同领域的数据集。