Dependencies among attributes are a common aspect of tabular data. However, whether existing tabular data generation algorithms preserve these dependencies while generating synthetic data is yet to be explored. In addition to the existing notion of functional dependencies, we introduce the notion of logical dependencies among the attributes in this article. Moreover, we provide a measure to quantify logical dependencies among attributes in tabular data. Utilizing this measure, we compare several state-of-the-art synthetic data generation algorithms and test their capability to preserve logical and functional dependencies on several publicly available datasets. We demonstrate that currently available synthetic tabular data generation algorithms do not fully preserve functional dependencies when they generate synthetic datasets. In addition, we also showed that some tabular synthetic data generation models can preserve inter-attribute logical dependencies. Our review and comparison of the state-of-the-art reveal research needs and opportunities to develop task-specific synthetic tabular data generation models.
翻译:属性间的依赖关系是表格数据的常见特征。然而,现有表格数据生成算法在生成合成数据时是否能够保持这些依赖关系,仍有待探索。除了现有的函数依赖概念外,本文引入了属性间逻辑依赖的概念。此外,我们提出了一种量化表格数据中属性间逻辑依赖关系的度量方法。利用该度量方法,我们比较了多种先进的合成数据生成算法,并在多个公开数据集上测试了它们保持逻辑与函数依赖关系的能力。研究表明,当前可用的合成表格数据生成算法在生成合成数据集时未能完全保持函数依赖关系。同时,我们也发现部分表格合成数据生成模型能够保持属性间的逻辑依赖关系。通过对前沿技术的综述与比较,我们揭示了开发面向特定任务的合成表格数据生成模型的研究需求与机遇。