Despite recent advances in synthetic data generation, the scientific community still lacks a unified consensus on its usefulness. It is commonly believed that synthetic data can be used for both data exchange and boosting machine learning (ML) training. Privacy-preserving synthetic data generation can accelerate data exchange for downstream tasks, but there is not enough evidence to show how or why synthetic data can boost ML training. In this study, we benchmarked ML performance using synthetic tabular data for four use cases: data sharing, data augmentation, class balancing, and data summarization. We observed marginal improvements for the balancing use case on some datasets. However, we conclude that there is not enough evidence to claim that synthetic tabular data is useful for ML training.
翻译:尽管合成数据生成技术近年来取得了进展,科学界对其实用性仍缺乏统一共识。普遍认为合成数据可用于数据交换和提升机器学习训练效果。隐私保护的合成数据生成能加速下游任务的数据交换,但尚无充分证据表明合成数据如何或为何能提升机器学习训练。本研究针对四个应用场景——数据共享、数据增强、类别平衡和数据总结——利用合成表格数据对机器学习性能进行了基准测试。我们在部分数据集的类别平衡场景中观察到边际改进。然而,我们的结论是:目前尚无足够证据表明合成表格数据对机器学习训练具有实用性。