The evaluation of abstractive summarization models typically uses test data that is identically distributed as training data. In real-world practice, documents to be summarized may contain input noise caused by text extraction artifacts or data pipeline bugs. The robustness of model performance under distribution shift caused by such noise is relatively under-studied. We present a large empirical study quantifying the sometimes severe loss in performance (up to 12 ROUGE-1 points) from different types of input noise for a range of datasets and model sizes. We then propose a light-weight method for detecting and removing such noise in the input during model inference without requiring any extra training, auxiliary models, or even prior knowledge of the type of noise. Our proposed approach effectively mitigates the loss in performance, recovering a large fraction of the performance drop, sometimes as large as 11 ROUGE-1 points.
翻译:抽象式摘要模型的评估通常使用与训练数据同分布的测试数据。在实际应用中,待摘要文档可能包含由文本提取伪影或数据管道错误引起的输入噪声。模型在此类噪声引起的分布偏移下的鲁棒性研究相对不足。我们通过大规模实证研究,量化了多种输入噪声类型在不同数据集和模型规模下导致的性能损失(最高可达12个ROUGE-1分)。随后提出了一种轻量级方法,可在模型推理过程中检测并移除输入中的此类噪声,该方法无需额外训练、辅助模型,甚至无需预先了解噪声类型。我们的方案有效缓解了性能损失,恢复了大部分性能下降(最高达11个ROUGE-1分)。