Detecting synthetic tabular data is essential to prevent the distribution of false or manipulated datasets that could compromise data-driven decision-making. This study explores whether synthetic tabular data can be reliably identified ''in the wild''-meaning across different generators, domains, and table formats. This challenge is unique to tabular data, where structures (such as number of columns, data types, and formats) can vary widely from one table to another. We propose three cross-table baseline detectors and four distinct evaluation protocols, each corresponding to a different level of ''wildness''. Our very preliminary results confirm that cross-table adaptation is a challenging task.
翻译:检测合成表格数据对于防止虚假或篡改数据集的传播至关重要,这些数据集可能危及数据驱动的决策过程。本研究探讨了是否能在“真实场景”中可靠地识别合成表格数据——即跨越不同生成器、领域和表格格式的情形。这一挑战对表格数据而言尤为独特,因为表格结构(如列数、数据类型和格式)在不同表格间可能存在巨大差异。我们提出了三种跨表基线检测器和四种不同的评估协议,每种协议对应不同级别的“真实场景”复杂度。我们的初步结果证实,跨表适应是一项具有挑战性的任务。