Recently there has been a lot of progress in text-to-video generation, with state-of-the-art models being capable of generating high quality, realistic videos. However, these models lack the capability for users to interactively control and generate videos, which can potentially unlock new areas of application. As a first step towards this goal, we tackle the problem of endowing diffusion-based video generation models with interactive spatio-temporal control over their output. To this end, we take inspiration from the recent advances in segmentation literature to propose a novel spatio-temporal masked attention module - Peekaboo. This module is a training-free, no-inference-overhead addition to off-the-shelf video generation models which enables spatio-temporal control. We also propose an evaluation benchmark for the interactive video generation task. Through extensive qualitative and quantitative evaluation, we establish that Peekaboo enables control video generation and even obtains a gain of upto 3.8x in mIoU over baseline models.
翻译:近年来,文本到视频生成领域取得了显著进展,最先进的模型已能生成高质量、逼真的视频。然而,这些模型缺乏用户交互式控制并生成视频的能力,而这一能力可能解锁新的应用领域。作为迈向该目标的第一步,我们致力于赋予基于扩散的视频生成模型对输出内容的交互式时空控制能力。为此,受最近图像分割领域进展的启发,我们提出了一种新颖的时空掩码注意力模块——Peekaboo。该模块是一种无需训练、无额外推理开销的插件,可直接添加到现有视频生成模型中,实现时空控制。此外,我们还为交互式视频生成任务提出了一个评估基准。通过广泛的定性和定量评估,我们证明Peekaboo能够实现可控视频生成,并在平均交并比(mIoU)上相比基线模型获得高达3.8倍的提升。