With the advancement of computer vision and natural language processing, text-to-video generation, enabled by text-to-video diffusion models, has become more prevalent. These models are trained using a large amount of data from the internet. However, the training data often contain copyrighted content, including cartoon character icons and artist styles, private portraits, and unsafe videos. Since filtering the data and retraining the model is challenging, methods for unlearning specific concepts from text-to-video diffusion models have been investigated. However, due to the high computational complexity and relative large optimization scale, there is little work on unlearning methods for text-to-video diffusion models. We propose a novel concept-unlearning method by transferring the unlearning capability of the text encoder of text-to-image diffusion models to text-to-video diffusion models. Specifically, the method optimizes the text encoder using few-shot unlearning, where several generated images are used. We then use the optimized text encoder in text-to-video diffusion models to generate videos. Our method costs low computation resources and has small optimization scale. We discuss the generated videos after unlearning a concept. The experiments demonstrates that our method can unlearn copyrighted cartoon characters, artist styles, objects and people's facial characteristics. Our method can unlearn a concept within about 100 seconds on an RTX 3070. Since there was no concept unlearning method for text-to-video diffusion models before, we make concept unlearning feasible and more accessible in the text-to-video domain.
翻译:随着计算机视觉和自然语言处理的进步,基于文本到视频扩散模型的文本到视频生成技术已日益普及。这些模型使用来自互联网的大量数据进行训练。然而,训练数据中常包含受版权保护的内容,如卡通角色图标与艺术家风格、私人肖像以及不安全视频。由于过滤数据并重新训练模型具有挑战性,研究者开始探索从文本到视频扩散模型中遗忘特定概念的方法。然而,由于计算复杂度高且优化规模相对较大,目前针对文本到视频扩散模型的遗忘方法研究甚少。本文提出一种新颖的概念遗忘方法,通过将文本到图像扩散模型中文本编码器的遗忘能力迁移至文本到视频扩散模型来实现。具体而言,该方法采用小样本遗忘策略优化文本编码器,其中使用若干生成图像作为训练样本。随后,将优化后的文本编码器应用于文本到视频扩散模型以生成视频。本方法计算资源消耗低且优化规模小。我们讨论了遗忘概念后生成的视频特性。实验表明,本方法能够有效遗忘受版权保护的卡通角色、艺术家风格、特定物体及人脸特征。在RTX 3070显卡上,本方法可在约100秒内完成单个概念的遗忘。鉴于此前文本到视频领域尚无成熟的概念遗忘方法,本研究使得该领域的概念遗忘技术变得可行且更易于实现。