We tackle the problem of feature unlearning from a pre-trained image generative model: GANs and VAEs. Unlike a common unlearning task where an unlearning target is a subset of the training set, we aim to unlearn a specific feature, such as hairstyle from facial images, from the pre-trained generative models. As the target feature is only presented in a local region of an image, unlearning the entire image from the pre-trained model may result in losing other details in the remaining region of the image. To specify which features to unlearn, we collect randomly generated images that contain the target features. We then identify a latent representation corresponding to the target feature and then use the representation to fine-tune the pre-trained model. Through experiments on MNIST and CelebA datasets, we show that target features are successfully removed while keeping the fidelity of the original models. Further experiments with an adversarial attack show that the unlearned model is more robust under the presence of malicious parties.
翻译:我们针对从预训练图像生成模型(GAN和VAE)中进行特征遗忘的问题展开研究。与常见的遗忘任务(其遗忘目标是训练集的一个子集)不同,本文旨在从预训练生成模型中遗忘特定特征,例如人脸图像中的发型。由于目标特征仅存在于图像的局部区域,若从预训练模型中遗忘整张图像,可能导致图像剩余区域的其他细节丢失。为了明确需要遗忘的特征,我们收集了包含目标特征的随机生成图像。随后,我们识别出与目标特征对应的潜在表示,并利用该表示对预训练模型进行微调。通过在MNIST和CelebA数据集上的实验,我们证明目标特征被成功移除,同时保持了原始模型的保真度。进一步的对抗攻击实验表明,在存在恶意攻击方的情况下,经过遗忘的模型具有更强的鲁棒性。