Abstractive multi-document summarization (MDS) is the task of automatically summarizing information in multiple documents, from news articles to conversations with multiple speakers. The training approaches for current MDS models can be grouped into four approaches: end-to-end with special pre-training ("direct"), chunk-then-summarize, extract-then-summarize, and inference with GPT-style models. In this work, we evaluate MDS models across training approaches, domains, and dimensions (reference similarity, quality, and factuality), to analyze how and why models trained on one domain can fail to summarize documents from another (News, Science, and Conversation) in the zero-shot domain transfer setting. We define domain-transfer "failure" as a decrease in factuality, higher deviation from the target, and a general decrease in summary quality. In addition to exploring domain transfer for MDS models, we examine potential issues with applying popular summarization metrics out-of-the-box.
翻译:抽象式多文档摘要(MDS)是一项从多篇文档(从新闻文章到多人对话)中自动汇总信息的任务。当前MDS模型的训练方法可分为四类:采用特殊预训练的端到端方法("直接法")、分块后摘要法、抽取后摘要法以及基于GPT风格模型的推理法。本研究通过跨训练方法、领域和评估维度(参考摘要相似度、质量与事实性)对MDS模型进行评估,旨在分析在零样本领域迁移设置中,基于某一领域训练的模型为何及如何在其他领域(新闻、科学与对话)的文档摘要任务中出现失效。我们将领域迁移"失败"定义为事实性降低、与目标摘要偏差增大以及摘要质量普遍下降。除探究MDS模型的领域迁移问题外,我们还考察了现成应用流行摘要评估指标可能存在的潜在问题。