While many have shown how Large Language Models (LLMs) can be applied to a diverse set of tasks, the critical issues of data contamination and memorization are often glossed over. In this work, we address this concern for tabular data. Starting with simple qualitative tests for whether an LLM knows the names and values of features, we introduce a variety of different techniques to assess the degrees of contamination, including statistical tests for conditional distribution modeling and four tests that identify memorization. Our investigation reveals that LLMs are pre-trained on many popular tabular datasets. This exposure can lead to invalid performance evaluation on downstream tasks because the LLMs have, in effect, been fit to the test set. Interestingly, we also identify a regime where the language model reproduces important statistics of the data, but fails to reproduce the dataset verbatim. On these datasets, although seen during training, good performance on downstream tasks might not be due to overfitting. Our findings underscore the need for ensuring data integrity in machine learning tasks with LLMs. To facilitate future research, we release an open-source tool that can perform various tests for memorization \url{https://github.com/interpretml/LLM-Tabular-Memorization-Checker}.
翻译:尽管许多研究展示了大型语言模型(LLMs)可应用于多样任务,但数据污染与记忆化等关键问题常被忽视。本研究针对表格数据探讨该问题。从检验LLM是否知晓特征名称及取值的简单定性测试出发,我们引入多种评估污染程度的技术,包括条件分布建模的统计检验以及四项识别记忆化的测试。我们的研究发现,LLMs在许多流行的表格数据集上进行了预训练。这种暴露可能导致下游任务性能评估无效,因为LLMs实际上已拟合了测试集。有趣的是,我们还识别出一种模式:语言模型能复现数据的重要统计特征,但无法逐字复现数据集。在此类训练阶段可见的数据集上,下游任务的优异表现可能并非源于过拟合。我们的研究结果强调了在使用LLMs的机器学习任务中确保数据完整性的必要性。为促进未来研究,我们发布了一款开源工具,可执行多种记忆化测试:\url{https://github.com/interpretml/LLM-Tabular-Memorization-Checker}