Time Travel in LLMs: Tracing Data Contamination in Large Language Models

Data contamination, i.e., the presence of test data from downstream tasks in the training data of large language models (LLMs), is a potential major issue in measuring LLMs' real effectiveness on other tasks. We propose a straightforward yet effective method for identifying data contamination within LLMs. At its core, our approach starts by identifying potential contamination at the instance level; using this information, our approach then assesses wider contamination at the partition level. To estimate contamination of individual instances, we employ "guided instruction:" a prompt consisting of the dataset name, partition type, and the random-length initial segment of a reference instance, asking the LLM to complete it. An instance is flagged as contaminated if the LLM's output either exactly or nearly matches the latter segment of the reference. To understand if an entire partition is contaminated, we propose two ideas. The first idea marks a dataset partition as contaminated if the average overlap score with the reference instances (as measured by ROUGE-L or BLEURT) is statistically significantly better with the completions from guided instruction compared to a "general instruction" that does not include the dataset and partition name. The second idea marks a dataset partition as contaminated if a classifier based on GPT-4 with few-shot in-context learning prompt marks multiple generated completions as exact/near-exact matches of the corresponding reference instances. Our best method achieves an accuracy between 92% and 100% in detecting if an LLM is contaminated with seven datasets, containing train and test/validation partitions, when contrasted with manual evaluation by human experts. Further, our findings indicate that GPT-4 is contaminated with AG News, WNLI, and XSum datasets.

翻译：数据污染，即下游任务的测试数据出现在大型语言模型（LLM）的训练数据中，是评估LLM在其他任务上真实有效性的潜在主要问题。我们提出了一种简单而有效的方法来识别LLM中的数据污染。其核心在于，我们的方法首先在实例层面识别潜在污染；利用这些信息，接着在分区层面评估更广泛的污染。为估计单个实例的污染程度，我们采用“引导指令”：由数据集名称、分区类型和参考实例随机长度的初始片段组成的提示，要求LLM完成该片段。若LLM的输出与参考实例的后续片段完全或几乎匹配，则该实例被标记为受污染。为判断整个分区是否受污染，我们提出两种思路。第一种思路是，若与未包含数据集和分区名称的“通用指令”相比，引导指令下完成的内容与参考实例的平均重叠分数（通过ROUGE-L或BLEURT衡量）在统计上显著更优，则将该数据集分区标记为受污染。第二种思路是，若基于GPT-4的少样本上下文学习提示的分类器将多个生成的完成内容标记为对应参考实例的精确/近似匹配，则将该数据集分区标记为受污染。我们的最佳方法在检测LLM是否受七个数据集（包含训练集和测试/验证集）污染时，与人类专家手动评估相比，准确率达到92%至100%。此外，我们的发现表明GPT-4受到AG News、WNLI和XSum数据集污染。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日