Time Travel in LLMs: Tracing Data Contamination in Large Language Models

Data contamination, i.e., the presence of test data from downstream tasks in the training data of large language models (LLMs), is a potential major issue in understanding LLMs' effectiveness on other tasks. We propose a straightforward yet effective method for identifying data contamination within LLMs. At its core, our approach starts by identifying potential contamination in individual instances that are drawn from a small random sample; using this information, our approach then assesses if an entire dataset partition is contaminated. To estimate contamination of individual instances, we employ "guided instruction:" a prompt consisting of the dataset name, partition type, and the initial segment of a reference instance, asking the LLM to complete it. An instance is flagged as contaminated if the LLM's output either exactly or closely matches the latter segment of the reference. To understand if an entire partition is contaminated, we propose two ideas. The first idea marks a dataset partition as contaminated if the average overlap score with the reference instances (as measured by ROUGE or BLEURT) is statistically significantly better with the guided instruction vs. a general instruction that does not include the dataset and partition name. The second idea marks a dataset as contaminated if a classifier based on GPT-4 with in-context learning prompting marks multiple instances as contaminated. Our best method achieves an accuracy between 92% and 100% in detecting if an LLM is contaminated with seven datasets, containing train and test/validation partitions, when contrasted with manual evaluation by human expert. Further, our findings indicate that GPT-4 is contaminated with AG News, WNLI, and XSum datasets.

翻译：数据污染，即下游任务测试数据出现在大型语言模型（LLM）训练集中，是理解LLM在其它任务上有效性的潜在重大问题。我们提出一种简单而有效的方法来识别LLM中的数据污染。该方法的核心是：首先识别从少量随机样本中提取的单个实例中可能存在的污染；然后利用这些信息评估整个数据集分区是否被污染。为估算单个实例的污染程度，我们采用“引导指令”：该提示包含数据集名称、分区类型和参考实例的初始片段，要求LLM完成续写。如果LLM的输出与参考实例的后继片段完全匹配或高度相似，则该实例被标记为污染。为判断整个分区是否被污染，我们提出两种方案。方案一：若使用引导指令得到的平均重叠分数（通过ROUGE或BLEURT衡量）在统计上显著优于不含数据集和分区名称的通用指令，则将该数据集分区标记为污染。方案二：若基于GPT-4的上下文学习提示分类器将多个实例标记为污染，则将该数据集标记为污染。经人工专家评估对比，我们的最优方法在检测LLM是否被七个数据集（含训练集和测试/验证集）污染时，准确率达到92%-100%。此外，研究结果表明GPT-4在AG News、WNLI和XSum数据集上存在污染。