We propose the Data Contamination Quiz (DCQ), a simple and effective approach to detect data contamination in large language models (LLMs) and estimate the amount of it. Specifically, we frame data contamination detection as a series of multiple-choice questions and devise a quiz format wherein three perturbed versions of each subsampled instance from a specific dataset partition (e.g., GSM8k test set) are created. These changes only include word-level perturbations. The generated perturbations, along with the original dataset instance, form the options in the DCQ, with an extra option accommodating the possibility of selecting none of the provided options. Given that the only distinguishing signal among the options is the exact wording with respect to the original dataset instance, an LLM, when tasked with identifying the original dataset instance, gravitates towards selecting the original one if it has been exposed to it in its pre-training phase -- a trait intrinsic to LLMs. While accounting for positional biases in LLMs, the quiz performance reveals the contamination level for the model being examined with the dataset partition to which the quiz pertains. Applied to various datasets with GPT-4 and GPT-3.5, our findings -- while fully lacking access to pre-training data and model parameters -- suggest that DCQ achieves state-of-the-art results and uncovers greater contamination/memorization levels compared to existing methods and proficiently bypasses more safety filters, especially those set to avoid generating copyrighted contents.
翻译:我们提出数据污染测验(DCQ),这是一种简单而有效的方法,用于检测大型语言模型(LLMs)中的数据污染并估计其程度。具体而言,我们将数据污染检测构建为一系列多项选择题,并设计了一种测验形式:从特定数据集划分(例如GSM8k测试集)的每个子采样实例中创建三个扰动版本。这些更改仅包括词级扰动。生成的扰动与原始数据集实例一起构成DCQ中的选项,并额外提供一个选项以容纳不选择任何给定选项的可能性。鉴于各选项间唯一的区分信号是相对于原始数据集实例的精确措辞,当要求LLM识别原始数据集实例时,如果其在预训练阶段已接触过该实例——这是LLMs固有的特性——它会倾向于选择原始实例。在考虑LLMs中的位置偏差的同时,测验表现揭示了所检测模型在测验所涉及的数据集划分上的污染水平。将DCQ应用于GPT-4和GPT-3.5的多个数据集,我们的发现——在完全无法访问预训练数据和模型参数的情况下——表明DCQ取得了最先进的结果,与现有方法相比揭示了更高的污染/记忆水平,并能有效绕过更多安全过滤器,尤其是那些为避免生成受版权保护内容而设置的过滤器。