Bridging the Gap: Enhancing the Utility of Synthetic Data via Post-Processing Techniques

Acquiring and annotating suitable datasets for training deep learning models is challenging. This often results in tedious and time-consuming efforts that can hinder research progress. However, generative models have emerged as a promising solution for generating synthetic datasets that can replace or augment real-world data. Despite this, the effectiveness of synthetic data is limited by their inability to fully capture the complexity and diversity of real-world data. To address this issue, we explore the use of Generative Adversarial Networks to generate synthetic datasets for training classifiers that are subsequently evaluated on real-world images. To improve the quality and diversity of the synthetic dataset, we propose three novel post-processing techniques: Dynamic Sample Filtering, Dynamic Dataset Recycle, and Expansion Trick. In addition, we introduce a pipeline called Gap Filler (GaFi), which applies these techniques in an optimal and coordinated manner to maximise classification accuracy on real-world data. Our experiments show that GaFi effectively reduces the gap with real-accuracy scores to an error of 2.03%, 1.78%, and 3.99% on the Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets, respectively. These results represent a new state of the art in Classification Accuracy Score and highlight the effectiveness of post-processing techniques in improving the quality of synthetic datasets.

翻译：获取和标注适用于训练深度学习模型的数据集具有挑战性，这通常导致繁琐且耗时的努力，可能阻碍研究进展。然而，生成模型已成为一种有前景的解决方案，用于生成可替代或增强真实世界数据的合成数据集。尽管如此，合成数据的有效性因其无法完全捕捉真实世界数据的复杂性和多样性而受限。为解决这一问题，我们探索使用生成对抗网络生成合成数据集，用于训练随后在真实世界图像上评估的分类器。为提高合成数据集的质量和多样性，我们提出了三种新颖的后处理技术：动态样本过滤、动态数据集回收和扩展技巧。此外，我们还引入了一个名为Gap Filler（GaFi）的流水线，该流水线以最优且协调的方式应用这些技术，以最大化在真实世界数据上的分类准确率。实验表明，GaFi在Fashion-MNIST、CIFAR-10和CIFAR-100数据集上有效将真实准确率分数的差距分别缩小至2.03%、1.78%和3.99%的误差。这些结果代表了分类准确率分数的新最先进水平，突显了后处理技术在提高合成数据集质量方面的有效性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

105+阅读 · 2022年2月10日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【医学图像分割| 2019新综述】生物医学图像分割的机器学习技术：技术方面综述和最新应用介绍（Machine Learning Techniques for Biomedical Image Segmentation: An Overview of Technical Aspects and Introduction to State-of-Art Applications），附35页PDF

专知会员服务

57+阅读 · 2019年11月23日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日