Recent pre-trained language models (PLMs) achieve promising results in existing abstractive summarization datasets. However, existing summarization benchmarks overlap in time with the standard pre-training corpora and finetuning datasets. Hence, the strong performance of PLMs may rely on the parametric knowledge that is memorized during pre-training and fine-tuning. Moreover, the knowledge memorized by PLMs may quickly become outdated, which affects the generalization performance of PLMs on future data. In this work, we propose TempoSum, a novel benchmark that contains data samples from 2010 to 2022, to understand the temporal generalization ability of abstractive summarization models. Through extensive human evaluation, we show that parametric knowledge stored in summarization models significantly affects the faithfulness of the generated summaries on future data. Moreover, existing faithfulness enhancement methods cannot reliably improve the faithfulness of summarization models on future data. Finally, we discuss several recommendations to the research community on how to evaluate and improve the temporal generalization capability of text summarization models.
翻译:近年来,预训练语言模型(PLM)在现有的抽象式摘要数据集中取得了令人瞩目的成果。然而,现有摘要基准测试在时间上与标准预训练语料库和微调数据集存在重叠。因此,PLM的优异表现可能依赖于预训练和微调过程中记忆的参数化知识。此外,PLM记忆的知识可能迅速过时,从而影响其在未来数据上的泛化性能。本文提出TempoSum这一新型基准测试,包含2010年至2022年间的数据样本,旨在探究抽象式摘要模型的时序泛化能力。通过广泛的人工评估,我们证明摘要模型中存储的参数化知识会显著影响模型对未来数据生成摘要的忠实性。同时,现有忠实性增强方法无法可靠地提升摘要模型在未来数据上的忠实性。最后,我们为学术界如何评估和提升文本摘要模型的时序泛化能力提出若干建议。