While many have shown how Large Language Models (LLMs) can be applied to a diverse set of tasks, the critical issues of data contamination and memorization are often glossed over. In this work, we address this concern for tabular data. Specifically, we introduce a variety of different techniques to assess whether a language model has seen a tabular dataset during training. This investigation reveals that LLMs have memorized many popular tabular datasets verbatim. We then compare the few-shot learning performance of LLMs on datasets that were seen during training to the performance on datasets released after training. We find that LLMs perform better on datasets seen during training, indicating that memorization leads to overfitting. At the same time, LLMs show non-trivial performance on novel datasets and are surprisingly robust to data transformations. We then investigate the in-context statistical learning abilities of LLMs. Without fine-tuning, we find them to be limited. This suggests that much of the few-shot performance on novel datasets is due to the LLM's world knowledge. Overall, our results highlight the importance of testing whether an LLM has seen an evaluation dataset during pre-training. We make the exposure tests we developed available as the tabmemcheck Python package at https://github.com/interpretml/LLM-Tabular-Memorization-Checker
翻译:尽管许多研究展示了大型语言模型(LLM)可应用于多种任务,但数据污染与记忆化等关键问题常被忽视。本研究针对表格数据探讨这一隐患。具体而言,我们引入了多种技术手段来评估语言模型在训练过程中是否接触过某表格数据集。研究发现,LLM已逐字记忆了许多流行表格数据集。随后,我们对比了LLM在训练期间见过的数据集上的少样本学习表现与训练后发布的数据集上的表现。结果表明,LLM在训练期间见过的数据集上表现更优,说明记忆化导致了过拟合。同时,LLM在未见数据集上展现出非平凡的性能,且对数据变换具有惊人鲁棒性。我们进一步探究了LLM的上下文统计学习能力,发现未经微调时其能力有限。这表明,在未见数据集上的少样本表现主要源于LLM的世界知识。总体而言,我们的研究结果凸显了在预训练前验证LLM是否见过评估数据集的重要性。我们将开发的暴露测试工具封装为tabmemcheck Python包,发布于https://github.com/interpretml/LLM-Tabular-Memorization-Checker