While many have shown how Large Language Models (LLMs) can be applied to a diverse set of tasks, the critical issues of data contamination and memorization are often glossed over. In this work, we address this concern for tabular data. Specifically, we introduce a variety of different techniques to assess whether a language model has seen a tabular dataset during training. This investigation reveals that LLMs have memorized many popular tabular datasets verbatim. We then compare the few-shot learning performance of LLMs on datasets that were seen during training to the performance on datasets released after training. We find that LLMs perform better on datasets seen during training, indicating that memorization leads to overfitting. At the same time, LLMs show non-trivial performance on novel datasets and are surprisingly robust to data transformations. We then investigate the in-context statistical learning abilities of LLMs. While LLMs are significantly better than random at solving statistical classification problems, the sample efficiency of few-shot learning lags behind traditional statistical learning algorithms, especially as the dimension of the problem increases. This suggests that much of the observed few-shot performance on novel real-world datasets is due to the LLM's world knowledge. Overall, our results highlight the importance of testing whether an LLM has seen an evaluation dataset during pre-training. We release the https://github.com/interpretml/LLM-Tabular-Memorization-Checker Python package to test LLMs for memorization of tabular datasets.
翻译:尽管已有许多研究展示了大型语言模型(LLM)如何应用于多样化的任务,但数据污染与记忆问题常被忽视。本研究针对表格数据探讨这一议题。具体而言,我们引入多种技术方法来评估语言模型是否在训练过程中接触过特定表格数据集。研究发现,LLM逐字记忆了大量流行的表格数据集。我们进一步比较了LLM在训练阶段接触过的数据集与训练后发布的数据集上的少样本学习性能,发现LLM在已见数据集上表现更优,表明记忆行为导致了过拟合。与此同时,LLM在新数据集上仍展现出显著性能,并对数据变换表现出惊人的鲁棒性。随后,我们探究了LLM在上下文中的统计学习能力:虽然LLM在解决统计分类问题上显著优于随机猜测,但其少样本学习的样本效率仍落后于传统统计学习算法,且该差距随问题维度增加而扩大。这表明LLM在现实世界新数据集上观察到的少样本性能主要源于其世界知识。总体而言,我们的研究结果凸显了检测LLM在预训练阶段是否接触过评估数据集的重要性。我们发布了https://github.com/interpretml/LLM-Tabular-Memorization-Checker Python工具包,用于检测LLM对表格数据集的记忆情况。