Generative data augmentation, which scales datasets by obtaining fake labeled examples from a trained conditional generative model, boosts classification performance in various learning tasks including (semi-)supervised learning, few-shot learning, and adversarially robust learning. However, little work has theoretically investigated the effect of generative data augmentation. To fill this gap, we establish a general stability bound in this not independently and identically distributed (non-i.i.d.) setting, where the learned distribution is dependent on the original train set and generally not the same as the true distribution. Our theoretical result includes the divergence between the learned distribution and the true distribution. It shows that generative data augmentation can enjoy a faster learning rate when the order of divergence term is $o(\max\left( \log(m)\beta_m, 1 / \sqrt{m})\right)$, where $m$ is the train set size and $\beta_m$ is the corresponding stability constant. We further specify the learning setup to the Gaussian mixture model and generative adversarial nets. We prove that in both cases, though generative data augmentation does not enjoy a faster learning rate, it can improve the learning guarantees at a constant level when the train set is small, which is significant when the awful overfitting occurs. Simulation results on the Gaussian mixture model and empirical results on generative adversarial nets support our theoretical conclusions. Our code is available at https://github.com/ML-GSAI/Understanding-GDA.
翻译:生成式数据增强通过从训练好的条件生成模型中获取虚假标注样本以扩展数据集,在包括(半)监督学习、少样本学习和对抗鲁棒学习在内的多种学习任务中提升了分类性能。然而,目前鲜有研究从理论层面探讨生成式数据增强的效果。为填补这一空白,我们针对这种非独立同分布(non-i.i.d.)设置建立了通用的稳定性界,其中学得的分布依赖于原始训练集,且通常与真实分布不同。我们的理论结果包含了学得分布与真实分布之间的散度项。结果表明,当散度项的阶数为 $o(\max\left( \log(m)\beta_m, 1 / \sqrt{m})\right)$ 时(其中 $m$ 为训练集大小,$\beta_m$ 为对应的稳定性常数),生成式数据增强可实现更快的学习速率。我们进一步将学习设置限定为高斯混合模型和生成对抗网络。我们证明了在这两种情形下,尽管生成式数据增强并未获得更快的学习速率,但它能在训练集较小时以常数级别改善学习保证,这在出现严重过拟合时尤为重要。基于高斯混合模型的仿真结果与基于生成对抗网络的实证结果均支持我们的理论结论。我们的代码已开源在 https://github.com/ML-GSAI/Understanding-GDA。