Detecting synthetic tabular data is essential to prevent the distribution of false or manipulated datasets that could compromise data-driven decision-making. This study explores whether synthetic tabular data can be reliably identified across different tables. This challenge is unique to tabular data, where structures (such as number of columns, data types, and formats) can vary widely from one table to another. We propose four table-agnostic detectors combined with simple preprocessing schemes that we evaluate on six evaluation protocols, with different levels of ''wildness''. Our results show that cross-table learning on a restricted set of tables is possible even with naive preprocessing schemes. They confirm however that cross-table transfer (i.e. deployment on a table that has not been seen before) is challenging. This suggests that sophisticated encoding schemes are required to handle this problem.
翻译:检测合成表格数据对于防止虚假或篡改数据集的传播至关重要,这些数据集可能损害数据驱动的决策过程。本研究探讨了在不同表格间能否可靠地识别合成表格数据。这一挑战对表格数据具有特殊性,因为表格结构(如列数、数据类型和格式)在不同表格间可能存在显著差异。我们提出了四种与表格无关的检测器,结合简单的预处理方案,并在六种具有不同"野生性"程度的评估协议上进行测试。结果表明,即使在采用简单预处理方案的情况下,在受限表格集合上进行跨表格学习是可行的。然而,结果也证实跨表格迁移(即在先前未见过的表格上部署检测器)具有挑战性。这表明需要采用复杂的编码方案来处理这一问题。