Evaluating Latent Knowledge of Public Tabular Datasets in Large Language Models

Large language models (LLMs) are increasingly exposed to data contamination, i.e., performance gains driven by prior exposure of test datasets rather than generalization. However, in the context of tabular data, this problem is largely unexplored. Existing approaches primarily rely on memorization tests, which are too coarse to detect contamination. In contrast, we propose a framework for assessing contamination in tabular datasets by generating controlled queries and performing comparative evaluation. Given a dataset, we craft multiple-choice aligned queries that preserve task structure while allowing systematic transformations of the underlying data. These transformations are designed to selectively disrupt dataset information while preserving partial knowledge, enabling us to isolate performance attributable to contamination. We complement this setup with non-neural baselines that provide reference performance, and we introduce a statistical testing procedure to formally detect significant deviations indicative of contamination. Empirical results on eight widely used tabular datasets reveal clear evidence of contamination in four cases. These findings suggest that performance on downstream tasks involving such datasets may be substantially inflated, raising concerns about the reliability of current evaluation practices.

翻译：大型语言模型（LLMs）日益面临数据污染问题，即测试数据集的先验暴露导致性能提升而非泛化能力。然而，在表格数据领域，这一问题尚未得到充分探讨。现有方法主要依赖记忆化测试，但该方法过于粗糙而难以检测污染。为此，我们提出一个通过生成受控查询并进行对比评估来检测表格数据集污染程度的框架。针对给定数据集，我们构造保留任务结构的多选对齐查询，同时允许对底层数据进行系统化变换。这些变换旨在选择性破坏数据集信息的同时保留部分知识，从而能够分离出可归因于污染的性能增益。我们通过非神经基线方法提供参考性能基准，并引入统计检验程序以正式检测指示污染存在的显著偏差。对八个广泛使用的表格数据集进行实证研究的结果显示，其中四个数据集存在明确的污染证据。这些发现表明，包含此类数据集的下游任务性能可能被显著夸大，进而引发对当前评估实践可靠性的质疑。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

评估大语言模型在科学发现中的作用

专知会员服务

19+阅读 · 2025年12月19日

大型语言模型的知识蒸馏与数据集蒸馏：新兴趋势、挑战与未来方向

专知会员服务

46+阅读 · 2025年4月26日

《大语言模型的数据合成与增强综述》

专知会员服务

44+阅读 · 2024年10月19日

大型语言模型的知识蒸馏综述：方法、评估与应用

专知会员服务

79+阅读 · 2024年7月4日