Large language models pretrained on extensive web corpora demonstrate remarkable performance across a wide range of downstream tasks. However, a growing concern is data contamination, where evaluation datasets may be contained in the pretraining corpus, inflating model performance. Decontamination, the process of detecting and removing such data, is a potential solution; yet these contaminants may originate from altered versions of the test set, evading detection during decontamination. How different types of contamination impact the performance of language models on downstream tasks is not fully understood. We present a taxonomy that categorizes the various types of contamination encountered by LLMs during the pretraining phase and identify which types pose the highest risk. We analyze the impact of contamination on two key NLP tasks -- summarization and question answering -- revealing how different types of contamination influence task performance during evaluation.
翻译:在广泛网络语料上预训练的大规模语言模型在下游任务中展现出卓越性能。然而,数据污染问题日益受到关注——评估数据集可能包含在预训练语料中,从而虚增模型性能表现。去污染(即检测并移除此类数据)是一种潜在的解决方案;但这些污染数据可能源自测试集的修改版本,从而在去污染过程中规避检测。目前对于不同类型污染如何影响语言模型在下游任务中的性能尚未得到充分理解。本文提出一种分类体系,用于归类大规模语言模型在预训练阶段可能遭遇的各类污染类型,并识别其中风险最高的类别。我们通过分析污染对两项关键自然语言处理任务(文本摘要与问答)的影响,揭示了评估过程中不同类型污染如何影响任务性能表现。