Data plays a crucial role in machine learning. However, in real-world applications, there are several problems with data, e.g., data are of low quality; a limited number of data points lead to under-fitting of the machine learning model; it is hard to access the data due to privacy, safety and regulatory concerns. Synthetic data generation offers a promising new avenue, as it can be shared and used in ways that real-world data cannot. This paper systematically reviews the existing works that leverage machine learning models for synthetic data generation. Specifically, we discuss the synthetic data generation works from several perspectives: (i) applications, including computer vision, speech, natural language, healthcare, and business; (ii) machine learning methods, particularly neural network architectures and deep generative models; (iii) privacy and fairness issue. In addition, we identify the challenges and opportunities in this emerging field and suggest future research directions.
翻译:数据在机器学习中扮演着关键角色。然而,在实际应用中,数据存在若干问题,例如:数据质量低、数据点数量有限导致机器学习模型欠拟合、以及因隐私、安全和监管问题难以获取数据。合成数据生成提供了一种前景广阔的新途径,它能够在真实数据无法实现的方式下进行共享和使用。本文系统性地审视了现有利用机器学习模型进行合成数据生成的研究工作。具体而言,我们从以下几个角度讨论了合成数据生成工作:(i)应用领域,包括计算机视觉、语音、自然语言、医疗健康和商业;(ii)机器学习方法,特别是神经网络架构和深度生成模型;(iii)隐私与公平性问题。此外,我们指出了这一新兴领域面临的挑战与机遇,并提出了未来研究方向。