We tackle the problem of feature unlearning from a pre-trained image generative model: GANs and VAEs. Unlike a common unlearning task where an unlearning target is a subset of the training set, we aim to unlearn a specific feature, such as hairstyle from facial images, from the pre-trained generative models. As the target feature is only presented in a local region of an image, unlearning the entire image from the pre-trained model may result in losing other details in the remaining region of the image. To specify which features to unlearn, we collect randomly generated images that contain the target features. We then identify a latent representation corresponding to the target feature and then use the representation to fine-tune the pre-trained model. Through experiments on MNIST and CelebA datasets, we show that target features are successfully removed while keeping the fidelity of the original models. Further experiments with an adversarial attack show that the unlearned model is more robust under the presence of malicious parties.
翻译:我们研究从预训练图像生成模型(GAN和VAE)中实现特征遗忘的问题。与常见的遗忘任务(目标为训练集的子集)不同,我们的目标是让预训练生成模型遗忘特定特征,例如面部图像中的发型。由于目标特征仅出现在图像的局部区域,若从预训练模型中遗忘整张图像,可能导致图像其余区域的细节丢失。为明确需遗忘的特征,我们收集包含目标特征的随机生成图像,进而识别出对应于该特征的潜在表征,并利用该表征对预训练模型进行微调。通过在MNIST和CelebA数据集上的实验表明,目标特征被成功移除,同时保持了原始模型的保真度。进一步对抗攻击实验证明,遗忘后的模型在恶意攻击环境下具有更强的鲁棒性。