Bridging the Gap: Enhancing the Utility of Synthetic Data via Post-Processing Techniques

Acquiring and annotating suitable datasets for training deep learning models is challenging. This often results in tedious and time-consuming efforts that can hinder research progress. However, generative models have emerged as a promising solution for generating synthetic datasets that can replace or augment real-world data. Despite this, the effectiveness of synthetic data is limited by their inability to fully capture the complexity and diversity of real-world data. To address this issue, we explore the use of Generative Adversarial Networks to generate synthetic datasets for training classifiers that are subsequently evaluated on real-world images. To improve the quality and diversity of the synthetic dataset, we propose three novel post-processing techniques: Dynamic Sample Filtering, Dynamic Dataset Recycle, and Expansion Trick. In addition, we introduce a pipeline called Gap Filler (GaFi), which applies these techniques in an optimal and coordinated manner to maximise classification accuracy on real-world data. Our experiments show that GaFi effectively reduces the gap with real-accuracy scores to an error of 2.03%, 1.78%, and 3.99% on the Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets, respectively. These results represent a new state of the art in Classification Accuracy Score and highlight the effectiveness of post-processing techniques in improving the quality of synthetic datasets.

翻译：获取和标注适合训练深度学习模型的数据集是一项挑战。这往往需要耗费大量时间和精力，可能阻碍研究进展。然而，生成模型已成为生成合成数据集的有前景的解决方案，可替代或增强真实世界数据。尽管如此，合成数据的有效性受限于其无法完全捕捉真实世界数据的复杂性和多样性。为解决这一问题，我们探索使用生成对抗网络生成合成数据集，用于训练分类器，随后在真实世界图像上进行评估。为提升合成数据集的质量和多样性，我们提出了三种新型后处理技术：动态样本过滤、动态数据集循环和扩展技巧。此外，我们引入了一个名为Gap Filler（GaFi）的流水线，它以最优且协调的方式应用这些技术，以最大化在真实世界数据上的分类准确率。实验表明，GaFi在Fashion-MNIST、CIFAR-10和CIFAR-100数据集上有效将真实准确率分数的差距分别缩小至2.03%、1.78%和3.99%的误差。这些结果代表了分类准确率分数的全新最优水平，并突显了后处理技术在提升合成数据集质量方面的有效性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日