Unlearning Concepts from Text-to-Video Diffusion Models

With the advancement of computer vision and natural language processing, text-to-video generation, enabled by text-to-video diffusion models, has become more prevalent. These models are trained using a large amount of data from the internet. However, the training data often contain copyrighted content, including cartoon character icons and artist styles, private portraits, and unsafe videos. Since filtering the data and retraining the model is challenging, methods for unlearning specific concepts from text-to-video diffusion models have been investigated. However, due to the high computational complexity and relative large optimization scale, there is little work on unlearning methods for text-to-video diffusion models. We propose a novel concept-unlearning method by transferring the unlearning capability of the text encoder of text-to-image diffusion models to text-to-video diffusion models. Specifically, the method optimizes the text encoder using few-shot unlearning, where several generated images are used. We then use the optimized text encoder in text-to-video diffusion models to generate videos. Our method costs low computation resources and has small optimization scale. We discuss the generated videos after unlearning a concept. The experiments demonstrates that our method can unlearn copyrighted cartoon characters, artist styles, objects and people's facial characteristics. Our method can unlearn a concept within about 100 seconds on an RTX 3070. Since there was no concept unlearning method for text-to-video diffusion models before, we make concept unlearning feasible and more accessible in the text-to-video domain.

翻译：随着计算机视觉和自然语言处理的进步，基于文本到视频扩散模型的文本到视频生成技术已日益普及。这些模型使用来自互联网的大量数据进行训练。然而，训练数据中常包含受版权保护的内容，如卡通角色图标与艺术家风格、私人肖像以及不安全视频。由于过滤数据并重新训练模型具有挑战性，研究者开始探索从文本到视频扩散模型中遗忘特定概念的方法。然而，由于计算复杂度高且优化规模相对较大，目前针对文本到视频扩散模型的遗忘方法研究甚少。本文提出一种新颖的概念遗忘方法，通过将文本到图像扩散模型中文本编码器的遗忘能力迁移至文本到视频扩散模型来实现。具体而言，该方法采用小样本遗忘策略优化文本编码器，其中使用若干生成图像作为训练样本。随后，将优化后的文本编码器应用于文本到视频扩散模型以生成视频。本方法计算资源消耗低且优化规模小。我们讨论了遗忘概念后生成的视频特性。实验表明，本方法能够有效遗忘受版权保护的卡通角色、艺术家风格、特定物体及人脸特征。在RTX 3070显卡上，本方法可在约100秒内完成单个概念的遗忘。鉴于此前文本到视频领域尚无成熟的概念遗忘方法，本研究使得该领域的概念遗忘技术变得可行且更易于实现。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日