While synthetic tabular data generation using Deep Generative Models (DGMs) offers a compelling solution to data scarcity and privacy concerns, their effectiveness relies on substantial training data, often unavailable in real-world applications. This paper addresses this challenge by proposing a novel methodology for generating realistic and reliable synthetic tabular data with DGMs in limited real-data environments. Our approach proposes several ways to generate an artificial inductive bias in a DGM through transfer learning and meta-learning techniques. We explore and compare four different methods within this framework, demonstrating that transfer learning strategies like pre-training and model averaging outperform meta-learning approaches, like Model-Agnostic Meta-Learning, and Domain Randomized Search. We validate our approach using two state-of-the-art DGMs, namely, a Variational Autoencoder and a Generative Adversarial Network, to show that our artificial inductive bias fuels superior synthetic data quality, as measured by Jensen-Shannon divergence, achieving relative gains of up to 50\% when using our proposed approach. This methodology has broad applicability in various DGMs and machine learning tasks, particularly in areas like healthcare and finance, where data scarcity is often a critical issue.
翻译:尽管使用深度生成模型(DGMs)生成合成表格数据为数据稀缺和隐私问题提供了引人注目的解决方案,但其有效性依赖于大量训练数据,而这在实际应用中往往难以获得。本文通过提出一种在有限真实数据环境下利用DGMs生成真实可靠合成表格数据的新方法,以应对这一挑战。我们的方法提出了通过迁移学习和元学习技术在DGM中生成人工归纳偏置的多种途径。我们在此框架内探索并比较了四种不同方法,结果表明,诸如预训练和模型平均等迁移学习策略优于模型无关元学习和领域随机搜索等元学习方法。我们使用两种先进的DGM——即变分自编码器和生成对抗网络——验证了我们的方法,结果表明我们的人工归纳偏置显著提升了合成数据质量(以Jensen-Shannon散度衡量),采用所提方法时相对增益最高可达50%。该方法在各种DGM和机器学习任务中具有广泛适用性,特别是在医疗健康和金融等数据稀缺常成为关键问题的领域。