The potential of realistic and useful synthetic data is significant. However, current evaluation methods for synthetic tabular data generation predominantly focus on downstream task usefulness, often neglecting the importance of statistical properties. This oversight becomes particularly prominent in low sample scenarios, accompanied by a swift deterioration of these statistical measures. In this paper, we address this issue by conducting an evaluation of three state-of-the-art synthetic tabular data generators based on their marginal distribution, column-pair correlation, joint distribution and downstream task utility performance across high to low sample regimes. The popular CTGAN model shows strong utility, but underperforms in low sample settings in terms of utility. To overcome this limitation, we propose MargCTGAN that adds feature matching of de-correlated marginals, which results in a consistent improvement in downstream utility as well as statistical properties of the synthetic data.
翻译:合成数据在真实性和实用性方面潜力巨大。然而,当前针对合成表格数据生成的评估方法主要关注下游任务的有用性,往往忽视了统计特性的重要性。这种疏忽在低样本场景下尤为突出,同时伴随着统计指标的迅速恶化。本文通过评估三种最先进的合成表格数据生成器在高、低样本场景下的边际分布、列对相关性、联合分布及下游任务效用性能,着手解决这一问题。流行的CTGAN模型表现出较强的实用性,但在低样本设置中效用表现不佳。为克服这一局限,我们提出MargCTGAN,它引入了去相关边际的特征匹配,从而在合成数据的下游效用及统计特性方面实现了持续改进。