Recent pre-trained language models (PLMs) achieve promising results in existing abstractive summarization datasets. However, existing summarization benchmarks overlap in time with the standard pre-training corpora and finetuning datasets. Hence, the strong performance of PLMs may rely on the parametric knowledge that is memorized during pre-training and fine-tuning. Moreover, the knowledge memorized by PLMs may quickly become outdated, which affects the generalization performance of PLMs on future data. In this work, we propose TempoSum, a novel benchmark that contains data samples from 2010 to 2022, to understand the temporal generalization ability of abstractive summarization models. Through extensive human evaluation, we show that parametric knowledge stored in summarization models significantly affects the faithfulness of the generated summaries on future data. Moreover, existing faithfulness enhancement methods cannot reliably improve the faithfulness of summarization models on future data. Finally, we discuss several recommendations to the research community on how to evaluate and improve the temporal generalization capability of text summarization models.
翻译:近期预训练语言模型(PLMs)在现有抽象式摘要数据集中取得了令人瞩目的成果。然而,现有摘要基准测试在时间维度上与标准预训练语料库和微调数据集存在重叠。因此,PLMs的优异表现可能依赖于其在预训练和微调阶段记忆的参数化知识。此外,PLMs所记忆的知识可能迅速过时,从而影响模型在未来数据上的泛化性能。本研究提出TempoSum这一新型基准测试,其包含2010年至2022年的数据样本,旨在探究抽象式摘要模型的时间泛化能力。通过大规模人工评估,我们证明摘要模型中存储的参数化知识会显著影响模型对未来数据生成摘要的忠实度。同时,现有忠实度增强方法无法可靠地提升摘要模型在未来数据上的忠实度。最后,我们向研究社区提出若干建议,探讨如何评估与改进文本摘要模型的时间泛化能力。