In an era of rapidly advancing data-driven applications, there is a growing demand for data in both research and practice. Synthetic data have emerged as an alternative when no real data is available (e.g., due to privacy regulations). Synthesizing tabular data presents unique and complex challenges, especially handling (i) missing values, (ii) dataset imbalance, (iii) diverse column types, and (iv) complex data distributions, as well as preserving (i) column correlations, (ii) temporal dependencies, and (iii) integrity constraints (e.g., functional dependencies) present in the original dataset. While substantial progress has been made recently in the context of generational models, there is no one-size-fits-all solution for tabular data today, and choosing the right tool for a given task is therefore no trivial task. In this paper, we survey the state of the art in Tabular Data Synthesis (TDS), examine the needs of users by defining a set of functional and non-functional requirements, and compile the challenges associated with meeting those needs. In addition, we evaluate the reported performance of 36 popular research TDS tools about these requirements and develop a decision guide to help users find suitable TDS tools for their applications. The resulting decision guide also identifies significant research gaps.
翻译:在数据驱动应用快速发展的时代,研究与实践中对数据的需求日益增长。当无法获取真实数据时(例如受隐私法规限制),合成数据已成为一种替代方案。表格数据的合成面临着独特而复杂的挑战,特别是需要处理(i)缺失值、(ii)数据集不平衡、(iii)多样化的列类型以及(iv)复杂的数据分布,同时还需保持原始数据集中存在的(i)列相关性、(ii)时间依赖性以及(iii)完整性约束(如函数依赖)。尽管生成模型领域近期已取得显著进展,但目前仍不存在适用于所有表格数据的通用解决方案,因此为特定任务选择合适的工具并非易事。本文系统综述了表格数据合成(TDS)的研究现状,通过定义一组功能性与非功能性需求来剖析用户需求,并梳理了满足这些需求所面临的挑战。此外,我们基于这些需求评估了36种主流TDS研究工具的报告性能,开发出一套决策指南以帮助用户为其应用场景匹配合适的TDS工具。该决策指南同时揭示了重要的研究空白领域。