In Multi-Document Summarization (MDS), the input can be modeled as a set of documents, and the output is its summary. In this paper, we focus on pretraining objectives for MDS. Specifically, we introduce a novel pretraining objective, which involves selecting the ROUGE-based centroid of each document cluster as a proxy for its summary. Our objective thus does not require human written summaries and can be utilized for pretraining on a dataset consisting solely of document sets. Through zero-shot, few-shot, and fully supervised experiments on multiple MDS datasets, we show that our model Centrum is better or comparable to a state-of-the-art model. We make the pretrained and fine-tuned models freely available to the research community https://github.com/ratishsp/centrum.
翻译:在多文档摘要(MDS)中,输入可建模为一组文档,输出为其摘要。本文聚焦于MDS的预训练目标,具体提出一种新颖的预训练目标——选取每个文档簇基于ROUGE的质心作为其摘要的替代物。该目标无需人工撰写的摘要,可应用于仅包含文档集的预训练数据集。通过多个MDS数据集上的零样本、少样本和全监督实验,我们证明了所提模型Centrum性能达到或优于当前最先进模型。我们已将预训练和微调后的模型免费开放给研究社区,地址为https://github.com/ratishsp/centrum。