Acquiring and annotating suitable datasets for training deep learning models is challenging. This often results in tedious and time-consuming efforts that can hinder research progress. However, generative models have emerged as a promising solution for generating synthetic datasets that can replace or augment real-world data. Despite this, the effectiveness of synthetic data is limited by their inability to fully capture the complexity and diversity of real-world data. To address this issue, we explore the use of Generative Adversarial Networks to generate synthetic datasets for training classifiers that are subsequently evaluated on real-world images. To improve the quality and diversity of the synthetic dataset, we propose three novel post-processing techniques: Dynamic Sample Filtering, Dynamic Dataset Recycle, and Expansion Trick. In addition, we introduce a pipeline called Gap Filler (GaFi), which applies these techniques in an optimal and coordinated manner to maximise classification accuracy on real-world data. Our experiments show that GaFi effectively reduces the gap with real-accuracy scores to an error of 2.03%, 1.78%, and 3.99% on the Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets, respectively. These results represent a new state of the art in Classification Accuracy Score and highlight the effectiveness of post-processing techniques in improving the quality of synthetic datasets.
翻译:获取和标注适用于训练深度学习模型的数据集具有挑战性,这通常导致繁琐且耗时的努力,可能阻碍研究进展。然而,生成模型已成为一种有前景的解决方案,用于生成可替代或增强真实世界数据的合成数据集。尽管如此,合成数据的有效性因其无法完全捕捉真实世界数据的复杂性和多样性而受限。为解决这一问题,我们探索使用生成对抗网络生成合成数据集,用于训练随后在真实世界图像上评估的分类器。为提高合成数据集的质量和多样性,我们提出了三种新颖的后处理技术:动态样本过滤、动态数据集回收和扩展技巧。此外,我们还引入了一个名为Gap Filler(GaFi)的流水线,该流水线以最优且协调的方式应用这些技术,以最大化在真实世界数据上的分类准确率。实验表明,GaFi在Fashion-MNIST、CIFAR-10和CIFAR-100数据集上有效将真实准确率分数的差距分别缩小至2.03%、1.78%和3.99%的误差。这些结果代表了分类准确率分数的新最先进水平,突显了后处理技术在提高合成数据集质量方面的有效性。