Machine Learning for Synthetic Data Generation: A Review

Data plays a crucial role in machine learning. However, in real-world applications, there are several problems with data, e.g., data are of low quality; a limited number of data points lead to under-fitting of the machine learning model; it is hard to access the data due to privacy, safety and regulatory concerns. Synthetic data generation offers a promising new avenue, as it can be shared and used in ways that real-world data cannot. This paper systematically reviews the existing works that leverage machine learning models for synthetic data generation. Specifically, we discuss the synthetic data generation works from several perspectives: (i) applications, including computer vision, speech, natural language, healthcare, and business; (ii) machine learning methods, particularly neural network architectures and deep generative models; (iii) privacy and fairness issue. In addition, we identify the challenges and opportunities in this emerging field and suggest future research directions.

翻译：数据在机器学习中扮演着关键角色。然而，在实际应用中，数据存在若干问题，例如：数据质量低、数据点数量有限导致机器学习模型欠拟合、以及因隐私、安全和监管问题难以获取数据。合成数据生成提供了一种前景广阔的新途径，它能够在真实数据无法实现的方式下进行共享和使用。本文系统性地审视了现有利用机器学习模型进行合成数据生成的研究工作。具体而言，我们从以下几个角度讨论了合成数据生成工作：（i）应用领域，包括计算机视觉、语音、自然语言、医疗健康和商业；（ii）机器学习方法，特别是神经网络架构和深度生成模型；（iii）隐私与公平性问题。此外，我们指出了这一新兴领域面临的挑战与机遇，并提出了未来研究方向。

相关内容

Machine Learning

关注 2251

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

【图机器学习进展与趋势@ICML2022】Graph Machine Learning @ ICML 2022

专知会员服务

40+阅读 · 2022年7月25日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

105+阅读 · 2022年2月10日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

专知会员服务

39+阅读 · 2020年11月3日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning