Data contamination in evaluation is getting increasingly prevalent with the emergence of language models pre-trained on super large, automatically crawled corpora. This problem leads to significant challenges in the accurate assessment of model capabilities and generalisations. In this paper, we propose LatestEval, an automatic method that leverages the most recent texts to create uncontaminated reading comprehension evaluations. LatestEval avoids data contamination by only using texts published within a recent time window, ensuring no overlap with the training corpora of pre-trained language models. We develop the LatestEval automated pipeline to 1) gather the latest texts; 2) identify key information, and 3) construct questions targeting the information while removing the existing answers from the context. This encourages models to infer the answers themselves based on the remaining context, rather than just copy-paste. Our experiments demonstrate that language models exhibit negligible memorisation behaviours on LatestEval as opposed to previous benchmarks, suggesting a significantly reduced risk of data contamination and leading to a more robust evaluation. Data and code are publicly available at: https://github.com/liyucheng09/LatestEval.
翻译:随着基于超大规模自动抓取语料预训练的语言模型的涌现,评估中的数据污染问题日益严峻。该问题对模型能力与泛化能力的精准评估构成了重大挑战。本文提出LatestEval——一种利用最新文本构建无污染阅读理解评估的自动化方法。该方法仅采用近期公开发布的文章作为测试文本,通过确保与预训练语言模型训练语料无重叠来规避数据污染。我们开发了LatestEval自动化流水线,其核心步骤包括:1)采集最新文本;2)识别关键信息;3)构建针对该信息的问题,同时从上下文中移除已有答案。这种设计促使模型基于剩余上下文自主推理答案,而非简单复制粘贴。实验表明,与既有基准测试相比,语言模型在LatestEval上表现出可忽略的记忆行为,表明数据污染风险显著降低,从而实现了更稳健的评估。相关数据与代码已开源:https://github.com/liyucheng09/LatestEval。