We investigate pre-training techniques for abstractive multi-document summarization (MDS), which is much less studied than summarizing single documents. Though recent work has demonstrated the effectiveness of highlighting information salience for pre-training strategy design, it struggles to generate abstractive and reflective summaries, which are critical properties for MDS. To this end, we present PELMS, a pre-trained model that uses objectives based on semantic coherence heuristics and faithfulness constraints with un-labeled multi-document inputs, to promote the generation of concise, fluent, and faithful summaries. To support the training of PELMS, we compile MultiPT, a multi-document pre-training corpus containing over 93 million documents to form more than 3 million unlabeled topic-centric document clusters, covering diverse genres such as product reviews, news, and general knowledge. We perform extensive evaluation of PELMS in low-shot settings on a wide range of MDS datasets. Our approach consistently outperforms competitive comparisons with respect to overall informativeness, abstractiveness, coherence, and faithfulness.
翻译:本文研究了抽象式多文档摘要(MDS)的预训练技术,该方向相较于单文档摘要研究尚不充分。尽管近期工作已证明突出信息显著性对预训练策略设计的有效性,但在生成摘要式与反思性摘要——这两个MDS的关键特性——方面仍面临挑战。为此,我们提出PELMS预训练模型,该模型基于语义连贯性启发式准则与忠实性约束目标,利用未标注的多文档输入,促进生成简洁、流畅且忠实的摘要。为支持PELMS训练,我们构建了MultiPT多文档预训练语料库,包含超过9300万篇文档,形成逾300万个未标注的主题中心文档集群,涵盖产品评论、新闻及常识知识等多种体裁。我们广泛评估了PELMS在低样本设置下多个MDS数据集上的表现。在信息量全面性、摘要性、连贯性与忠实性方面,我们的方法始终优于具有竞争力的对比基准。