Acquiring and annotating suitable datasets for training deep learning models is challenging. This often results in tedious and time-consuming efforts that can hinder research progress. However, generative models have emerged as a promising solution for generating synthetic datasets that can replace or augment real-world data. Despite this, the effectiveness of synthetic data is limited by their inability to fully capture the complexity and diversity of real-world data. To address this issue, we explore the use of Generative Adversarial Networks to generate synthetic datasets for training classifiers that are subsequently evaluated on real-world images. To improve the quality and diversity of the synthetic dataset, we propose three novel post-processing techniques: Dynamic Sample Filtering, Dynamic Dataset Recycle, and Expansion Trick. In addition, we introduce a pipeline called Gap Filler (GaFi), which applies these techniques in an optimal and coordinated manner to maximise classification accuracy on real-world data. Our experiments show that GaFi effectively reduces the gap with real-accuracy scores to an error of 2.03%, 1.78%, and 3.99% on the Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets, respectively. These results represent a new state of the art in Classification Accuracy Score and highlight the effectiveness of post-processing techniques in improving the quality of synthetic datasets.
翻译:获取和标注适合训练深度学习模型的数据集是一项挑战。这往往需要耗费大量时间和精力,可能阻碍研究进展。然而,生成模型已成为生成合成数据集的有前景的解决方案,可替代或增强真实世界数据。尽管如此,合成数据的有效性受限于其无法完全捕捉真实世界数据的复杂性和多样性。为解决这一问题,我们探索使用生成对抗网络生成合成数据集,用于训练分类器,随后在真实世界图像上进行评估。为提升合成数据集的质量和多样性,我们提出了三种新型后处理技术:动态样本过滤、动态数据集循环和扩展技巧。此外,我们引入了一个名为Gap Filler(GaFi)的流水线,它以最优且协调的方式应用这些技术,以最大化在真实世界数据上的分类准确率。实验表明,GaFi在Fashion-MNIST、CIFAR-10和CIFAR-100数据集上有效将真实准确率分数的差距分别缩小至2.03%、1.78%和3.99%的误差。这些结果代表了分类准确率分数的全新最优水平,并突显了后处理技术在提升合成数据集质量方面的有效性。