Leveraging Contaminated Datasets to Learn Clean-Data Distribution with Purified Generative Adversarial Networks

Generative adversarial networks (GANs) are known for their strong abilities on capturing the underlying distribution of training instances. Since the seminal work of GAN, many variants of GAN have been proposed. However, existing GANs are almost established on the assumption that the training dataset is clean. But in many real-world applications, this may not hold, that is, the training dataset may be contaminated by a proportion of undesired instances. When training on such datasets, existing GANs will learn a mixture distribution of desired and contaminated instances, rather than the desired distribution of desired data only (target distribution). To learn the target distribution from contaminated datasets, two purified generative adversarial networks (PuriGAN) are developed, in which the discriminators are augmented with the capability to distinguish between target and contaminated instances by leveraging an extra dataset solely composed of contamination instances. We prove that under some mild conditions, the proposed PuriGANs are guaranteed to converge to the distribution of desired instances. Experimental results on several datasets demonstrate that the proposed PuriGANs are able to generate much better images from the desired distribution than comparable baselines when trained on contaminated datasets. In addition, we also demonstrate the usefulness of PuriGAN on downstream applications by applying it to the tasks of semi-supervised anomaly detection on contaminated datasets and PU-learning. Experimental results show that PuriGAN is able to deliver the best performance over comparable baselines on both tasks.

翻译：生成对抗网络（GANs）以其捕获训练实例潜在分布的强大能力而闻名。自GAN的开创性工作以来，已提出多种GAN变体。然而，现有GAN几乎都建立在训练数据集是干净的假设之上。但在许多实际应用中，这一假设可能不成立，即训练数据集可能被一定比例的非期望实例污染。在此类数据集上训练时，现有GAN将学习到期望和非期望实例的混合分布，而非仅期望数据的期望分布（目标分布）。为从受污染数据集中学习目标分布，本文开发了两种纯化生成对抗网络（PuriGAN），其判别器通过利用仅包含污染实例的额外数据集，增强了对目标实例与污染实例的区分能力。我们证明，在温和条件下，所提出的PuriGAN保证收敛到期望实例的分布。在多个数据集上的实验结果表明，当在受污染数据集上训练时，所提出的PuriGAN能够从目标分布中生成比可比基线更优质的图像。此外，我们还通过将PuriGAN应用于受污染数据集上的半监督异常检测和PU学习任务，展示了其在下游应用中的实用性。实验结果表明，PuriGAN在这两项任务中均能展现优于可比基线的性能。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html