Video generation models (VGMs) have demonstrated the capability to synthesize high-quality output. It is important to understand their potential to produce unsafe content, such as violent or terrifying videos. In this work, we provide a comprehensive understanding of unsafe video generation. First, to confirm the possibility that these models could indeed generate unsafe videos, we choose unsafe content generation prompts collected from 4chan and Lexica, and three open-source SOTA VGMs to generate unsafe videos. After filtering out duplicates and poorly generated content, we created an initial set of 2112 unsafe videos from an original pool of 5607 videos. Through clustering and thematic coding analysis of these generated videos, we identify 5 unsafe video categories: Distorted/Weird, Terrifying, Pornographic, Violent/Bloody, and Political. With IRB approval, we then recruit online participants to help label the generated videos. Based on the annotations submitted by 403 participants, we identified 937 unsafe videos from the initial video set. With the labeled information and the corresponding prompts, we created the first dataset of unsafe videos generated by VGMs. We then study possible defense mechanisms to prevent the generation of unsafe videos. Existing defense methods in image generation focus on filtering either input prompt or output results. We propose a new approach called Latent Variable Defense (LVD), which works within the model's internal sampling process. LVD can achieve 0.90 defense accuracy while reducing time and computing resources by 10x when sampling a large number of unsafe prompts.
翻译:视频生成模型(VGMs)已展现出合成高质量内容的能力。理解其生成不安全内容(如暴力或恐怖视频)的潜在风险至关重要。在本研究中,我们对不安全视频生成进行了全面探究。首先,为确认这些模型确实可能生成不安全视频,我们选取了从4chan和Lexica收集的不安全内容生成提示词,并采用三种开源的最先进VGMs来生成不安全视频。经过重复项和低质量内容的筛选,我们从初始的5607个视频中创建了包含2112个不安全视频的初步集合。通过对这些生成视频进行聚类和主题编码分析,我们识别出5类不安全视频:扭曲/怪异、恐怖、色情、暴力/血腥以及政治。在获得机构审查委员会批准后,我们招募在线参与者协助标注生成的视频。基于403名参与者提交的标注结果,我们从初始视频集中识别出937个不安全视频。利用标注信息及对应提示词,我们创建了首个由VGMs生成的不安全视频数据集。随后,我们研究了防止不安全视频生成的可能防御机制。现有图像生成领域的防御方法主要侧重于过滤输入提示词或输出结果。我们提出了一种称为潜在变量防御(LVD)的新方法,该方法作用于模型内部采样过程。LVD在采样大量不安全提示词时,能以10倍的时间与计算资源缩减实现0.90的防御准确率。