评估大型语言模型对公开表格数据集的潜在知识 (Evaluating Latent Knowledge of Public Tabular Datasets in Large Language Models)

Large Language Models (LLMs) are increasingly evaluated on their ability to reason over structured data, yet such assessments often overlook a crucial confound: dataset contamination. In this work, we investigate whether LLMs exhibit prior knowledge of widely used tabular benchmarks such as Adult Income, Titanic, and others. Through a series of controlled probing experiments, we reveal that contamination effects emerge exclusively for datasets containing strong semantic cues-for instance, meaningful column names or interpretable value categories. In contrast, when such cues are removed or randomized, performance sharply declines to near-random levels. These findings suggest that LLMs' apparent competence on tabular reasoning tasks may, in part, reflect memorization of publicly available datasets rather than genuine generalization. We discuss implications for evaluation protocols and propose strategies to disentangle semantic leakage from authentic reasoning ability in future LLM assessments.

翻译：大型语言模型（LLMs）在结构化数据推理能力上的评估日益增多，然而此类评估常常忽略一个关键的混淆因素：数据集污染。本研究探讨了LLMs是否对广泛使用的表格基准数据集（如Adult Income、Titanic等）表现出先验知识。通过一系列受控探测实验，我们发现污染效应仅出现在包含强语义线索的数据集中——例如有意义的列名或可解释的数值类别。相反，当移除或随机化这些线索时，模型性能急剧下降至接近随机水平。这些发现表明，LLMs在表格推理任务上表现出的明显能力，可能部分反映了对公开可用数据集的记忆，而非真正的泛化能力。我们讨论了评估方案的影响，并为未来LLM评估中区分语义泄漏与真实推理能力提出了策略建议。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日