This paper investigates methods for improving generative data augmentation for deep learning. Generative data augmentation leverages the synthetic samples produced by generative models as an additional dataset for classification with small dataset settings. A key challenge of generative data augmentation is that the synthetic data contain uninformative samples that degrade accuracy. This is because the synthetic samples do not perfectly represent class categories in real data and uniform sampling does not necessarily provide useful samples for tasks. In this paper, we present a novel strategy for generative data augmentation called meta generative regularization (MGR). To avoid the degradation of generative data augmentation, MGR utilizes synthetic samples in the regularization term for feature extractors instead of in the loss function, e.g., cross-entropy. These synthetic samples are dynamically determined to minimize the validation losses through meta-learning. We observed that MGR can avoid the performance degradation of na\"ive generative data augmentation and boost the baselines. Experiments on six datasets showed that MGR is effective particularly when datasets are smaller and stably outperforms baselines.
翻译:本文研究了改进深度学习生成式数据增强的方法。生成式数据增强将生成模型产生的合成样本作为额外数据集,用于小样本场景下的分类任务。该方法的关键挑战在于合成数据包含会降低准确率的无信息样本,这是因为合成样本未能完美呈现真实数据中的类别分布,而均匀采样未必能为任务提供有效样本。本文提出了一种名为元生成正则化(MGR)的新型生成式数据增强策略。为避免生成式数据增强的性能退化,MGR将合成样本用于特征提取器的正则化项而非损失函数(如交叉熵)。这些合成样本通过元学习动态确定,以最小化验证损失。我们观察到MGR能避免朴素生成式数据增强的性能退化,并提升基线模型表现。在六个数据集上的实验表明,MGR在数据集较小时尤为有效,且稳定优于基线方法。