Recent years have witnessed a surge in the popularity of Machine Learning (ML), applied across diverse domains. However, progress is impeded by the scarcity of training data due to expensive acquisition and privacy legislation. Synthetic data emerges as a solution, but the abundance of released models and limited overview literature pose challenges for decision-making. This work surveys 417 Synthetic Data Generation (SDG) models over the last decade, providing a comprehensive overview of model types, functionality, and improvements. Common attributes are identified, leading to a classification and trend analysis. The findings reveal increased model performance and complexity, with neural network-based approaches prevailing, except for privacy-preserving data generation. Computer vision dominates, with GANs as primary generative models, while diffusion models, transformers, and RNNs compete. Implications from our performance evaluation highlight the scarcity of common metrics and datasets, making comparisons challenging. Additionally, the neglect of training and computational costs in literature necessitates attention in future research. This work serves as a guide for SDG model selection and identifies crucial areas for future exploration.
翻译:近年来,机器学习(ML)在多个领域的应用热度急剧上升。然而,由于数据获取成本高昂及隐私法规限制导致的训练数据稀缺,其发展进程受到阻碍。合成数据作为解决方案应运而生,但大量已发布的模型与有限的综述文献给决策带来了挑战。本文对过去十年间的417个合成数据生成(SDG)模型进行了梳理,系统概述了模型类型、功能及改进方向。通过识别共同特征,本研究进行了分类与趋势分析。结果表明:模型性能与复杂度持续提升,除隐私保护数据生成外,基于神经网络的方法占据主流;计算机视觉领域主导地位凸显,生成对抗网络(GAN)为主要生成模型,扩散模型、Transformer及循环神经网络(RNN)形成竞争格局。性能评估显示,通用指标与数据集的匮乏使比较分析面临困难,且文献中对训练及计算成本的忽视需在后续研究中予以关注。本研究既为SDG模型选择提供指导,亦明确了未来探索的关键领域。