Large language models (LLMs) are increasingly exposed to data contamination, i.e., performance gains driven by prior exposure of test datasets rather than generalization. However, in the context of tabular data, this problem is largely unexplored. Existing approaches primarily rely on memorization tests, which are too coarse to detect contamination. In contrast, we propose a framework for assessing contamination in tabular datasets by generating controlled queries and performing comparative evaluation. Given a dataset, we craft multiple-choice aligned queries that preserve task structure while allowing systematic transformations of the underlying data. These transformations are designed to selectively disrupt dataset information while preserving partial knowledge, enabling us to isolate performance attributable to contamination. We complement this setup with non-neural baselines that provide reference performance, and we introduce a statistical testing procedure to formally detect significant deviations indicative of contamination. Empirical results on eight widely used tabular datasets reveal clear evidence of contamination in four cases. These findings suggest that performance on downstream tasks involving such datasets may be substantially inflated, raising concerns about the reliability of current evaluation practices.
翻译:大型语言模型(LLMs)日益面临数据污染问题,即测试数据集的先验暴露导致性能提升而非泛化能力。然而,在表格数据领域,这一问题尚未得到充分探讨。现有方法主要依赖记忆化测试,但该方法过于粗糙而难以检测污染。为此,我们提出一个通过生成受控查询并进行对比评估来检测表格数据集污染程度的框架。针对给定数据集,我们构造保留任务结构的多选对齐查询,同时允许对底层数据进行系统化变换。这些变换旨在选择性破坏数据集信息的同时保留部分知识,从而能够分离出可归因于污染的性能增益。我们通过非神经基线方法提供参考性能基准,并引入统计检验程序以正式检测指示污染存在的显著偏差。对八个广泛使用的表格数据集进行实证研究的结果显示,其中四个数据集存在明确的污染证据。这些发现表明,包含此类数据集的下游任务性能可能被显著夸大,进而引发对当前评估实践可靠性的质疑。