Generating synthetic data through generative models is gaining interest in the ML community and beyond, promising a future where datasets can be tailored to individual needs. Unfortunately, synthetic data is usually not perfect, resulting in potential errors in downstream tasks. In this work we explore how the generative process affects the downstream ML task. We show that the naive synthetic data approach -- using synthetic data as if it is real -- leads to downstream models and analyses that do not generalize well to real data. As a first step towards better ML in the synthetic data regime, we introduce Deep Generative Ensemble (DGE) -- a framework inspired by Deep Ensembles that aims to implicitly approximate the posterior distribution over the generative process model parameters. DGE improves downstream model training, evaluation, and uncertainty quantification, vastly outperforming the naive approach on average. The largest improvements are achieved for minority classes and low-density regions of the original data, for which the generative uncertainty is largest.
翻译:通过生成模型生成合成数据在机器学习社区及其他领域日益受到关注,有望实现根据个体需求定制数据集的未来。然而,合成数据通常并非完美无缺,可能导致下游任务中出现潜在错误。本研究探讨生成过程如何影响下游机器学习任务。我们表明,天真的合成数据方法——将合成数据视为真实数据使用——会导致下游模型和分析无法很好地泛化到真实数据。作为在合成数据领域改进机器学习的第一步,我们提出了深度生成集成(DGE)——一种受深度集成启发的框架,旨在隐式近似生成过程模型参数上的后验分布。DGE改进了下游模型的训练、评估和不确定性量化,平均而言远优于天真方法。最大的改进体现在原始数据的少数类和低密度区域,这些区域的生成不确定性最大。