Data contamination in evaluation is getting increasingly prevalent with the emergence of language models pre-trained on super large, automatically crawled corpora. This problem leads to significant challenges in the accurate assessment of model capabilities and generalisations. In this paper, we propose LatestEval, an automatic method that leverages the most recent texts to create uncontaminated reading comprehension evaluations. LatestEval avoids data contamination by only using texts published within a recent time window, ensuring no overlap with the training corpora of pre-trained language models. We develop the LatestEval automated pipeline to 1) gather the latest texts; 2) identify key information, and 3) construct questions targeting the information while removing the existing answers from the context. This encourages models to infer the answers themselves based on the remaining context, rather than just copy-paste. Our experiments demonstrate that language models exhibit negligible memorisation behaviours on LatestEval as opposed to previous benchmarks, suggesting a significantly reduced risk of data contamination and leading to a more robust evaluation. Data and code are publicly available at: https://github.com/liyucheng09/LatestEval.
翻译:数据污染在评估中日益普遍,这源于基于超大规模自动爬取语料库预训练的语言模型的涌现。该问题对模型能力及泛化性能的准确评估带来了重大挑战。本文提出LatestEval——一种利用最新文本自动生成无污染阅读理解评估的方法。LatestEval仅使用近期时间窗口内发布的文本,确保与预训练语言模型的训练语料无重叠,从而避免数据污染。我们开发了LatestEval自动流水线以:1)采集最新文本;2)识别关键信息;3)针对该信息构建问题,同时从上下文中移除已有答案。此举鼓励模型基于剩余上下文自主推断答案,而非简单复制粘贴。实验表明,与先前基准相比,语言模型在LatestEval上表现出可忽略的记忆行为,这显著降低了数据污染风险,进而实现更鲁棒的评估。数据和代码已开源:https://github.com/liyucheng09/LatestEval