Synthetic training data has gained prominence in numerous learning tasks and scenarios, offering advantages such as dataset augmentation, generalization evaluation, and privacy preservation. Despite these benefits, the efficiency of synthetic data generated by current methodologies remains inferior when training advanced deep models exclusively, limiting its practical utility. To address this challenge, we analyze the principles underlying training data synthesis for supervised learning and elucidate a principled theoretical framework from the distribution-matching perspective that explicates the mechanisms governing synthesis efficacy. Through extensive experiments, we demonstrate the effectiveness of our synthetic data across diverse image classification tasks, both as a replacement for and augmentation to real datasets, while also benefits challenging tasks such as out-of-distribution generalization and privacy preservation.
翻译:合成训练数据已在众多学习任务和场景中崭露头角,具有数据集扩充、泛化评估和隐私保护等优势。尽管存在这些优点,当前方法生成的合成数据在单独训练先进深度模型时效率仍显不足,限制了其实用价值。为解决这一挑战,我们分析了监督学习场景下训练数据合成的内在原理,并从分布匹配视角阐明了一个原则性的理论框架,揭示了控制合成效能的机制。通过大量实验,我们证明合成数据在多种图像分类任务中的有效性——既可替代真实数据集,也可作为其补充,同时在分布外泛化和隐私保护等具有挑战性的任务中亦展现出显著优势。