Generating synthetic data through generative models is gaining interest in the ML community and beyond. In the past, synthetic data was often regarded as a means to private data release, but a surge of recent papers explore how its potential reaches much further than this -- from creating more fair data to data augmentation, and from simulation to text generated by ChatGPT. In this perspective we explore whether, and how, synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs. Just as importantly, we discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data -- the most important of which is quantifying how much we can trust any finding or prediction drawn from synthetic data.
翻译:通过生成模型合成数据正引起机器学习领域及更广泛领域的关注。过去,合成数据常被视为实现私密数据发布的手段,但近期大量论文探索其潜力远不止于此——从创造更公平的数据到数据增强,从模拟到ChatGPT生成的文本。本文探讨合成数据是否及如何成为机器学习世界的主导力量,预示着一个数据集可定制于个体需求的未来。同样重要的是,我们讨论了社区需要克服哪些根本性挑战以实现合成数据的更广泛应用——其中最关键的是量化我们能在多大程度上信任从合成数据中得出的任何发现或预测。