Tabular data is common yet typically incomplete, small in volume, and access-restricted due to privacy concerns. Synthetic data generation offers potential solutions. Many metrics exist for evaluating the quality of synthetic tabular data; however, we lack an objective, coherent interpretation of the many metrics. To address this issue, we propose an evaluation framework with a single, mathematical objective that posits that the synthetic data should be drawn from the same distribution as the observed data. Through various structural decomposition of the objective, this framework allows us to reason for the first time the completeness of any set of metrics, as well as unifies existing metrics, including those that stem from fidelity considerations, downstream application, and model-based approaches. Moreover, the framework motivates model-free baselines and a new spectrum of metrics. We evaluate structurally informed synthesizers and synthesizers powered by deep learning. Using our structured framework, we show that synthetic data generators that explicitly represent tabular structure outperform other methods, especially on smaller datasets.
翻译:表格数据普遍存在,但通常不完整、数据量小,且因隐私限制而难以访问。合成数据生成技术提供了潜在的解决方案。目前存在多种评估合成表格数据质量的指标,但缺乏对这些指标的客观、连贯的解读。为解决这一问题,我们提出一个评估框架,其核心目标为数学上的单一目标:合成数据应与观测数据来自同一分布。通过对该目标进行多种结构分解,该框架首次使我们能够推理任意一组指标的完备性,并统一现有指标,包括源于保真度考量、下游应用及基于模型方法的指标。此外,该框架还催生了无模型基线及一系列新指标。我们对结构感知合成器与基于深度学习的合成器进行了评估。通过结构化框架,我们证明显式建模表格结构的合成数据生成器优于其他方法,尤其在处理较小数据集时表现更佳。