We tackle the problem of feature unlearning from a pretrained image generative model. Unlike a common unlearning task where an unlearning target is a subset of the training set, we aim to unlearn a specific feature, such as hairstyle from facial images, from the pretrained generative models. As the target feature is only presented in a local region of an image, unlearning the entire image from the pretrained model may result in losing other details in the remaining region of the image. To specify which features to unlearn, we develop an implicit feedback mechanism where a user can select images containing the target feature. From the implicit feedback, we identify a latent representation corresponding to the target feature and then use the representation to unlearn the generative model. Our framework is generalizable for the two well-known families of generative models: GANs and VAEs. Through experiments on MNIST and CelebA datasets, we show that target features are successfully removed while keeping the fidelity of the original models.
翻译:我们解决了从预训练图像生成模型中遗忘特定特征的问题。与常见的遗忘任务(目标为训练集的子集)不同,我们旨在从预训练生成模型中遗忘特定特征(如人脸图像中的发型)。由于目标特征仅出现在图像的局部区域,从预训练模型中遗忘整张图像可能导致剩余区域中其他细节的丢失。为明确需要遗忘的特征,我们开发了一种隐式反馈机制,允许用户选择包含目标特征的图像。通过隐式反馈,我们识别出与目标特征对应的潜在表征,并利用该表征对生成模型进行遗忘。本框架可推广至两类主流生成模型:生成对抗网络(GANs)和变分自编码器(VAEs)。在MNIST和CelebA数据集上的实验表明,目标特征被成功移除,同时保持了原始模型的保真度。