The development of text-to-video (T2V), i.e., generating videos with a given text prompt, has been significantly advanced in recent years. However, relying solely on text prompts often results in ambiguous frame composition due to spatial uncertainty. The research community thus leverages the dense structure signals, e.g., per-frame depth/edge sequences, to enhance controllability, whose collection accordingly increases the burden of inference. In this work, we present SparseCtrl to enable flexible structure control with temporally sparse signals, requiring only one or a few inputs, as shown in Figure 1. It incorporates an additional condition encoder to process these sparse signals while leaving the pre-trained T2V model untouched. The proposed approach is compatible with various modalities, including sketches, depth maps, and RGB images, providing more practical control for video generation and promoting applications such as storyboarding, depth rendering, keyframe animation, and interpolation. Extensive experiments demonstrate the generalization of SparseCtrl on both original and personalized T2V generators. Codes and models will be publicly available at https://guoyww.github.io/projects/SparseCtrl .
翻译:近年来,文本生成视频(T2V)技术,即根据给定文本提示生成视频的技术,取得了显著进展。然而,仅依赖文本提示往往因空间不确定性而导致帧合成结果模糊。为此,研究界采用密集结构信号(如每帧深度/边缘序列)来增强可控性,但采集这些信号会增加推理负担。本文提出SparseCtrl,通过时间上稀疏的信号(仅需一个或少数几个输入,如图1所示)实现灵活的结构控制。该方法引入一个额外的条件编码器处理这些稀疏信号,同时保持预训练的T2V模型不变。该方案兼容多种模态,包括草图、深度图和RGB图像,为视频生成提供更实用的控制,并支持故事板制作、深度渲染、关键帧动画及插值等应用。大量实验表明,SparseCtrl在原始和个性化的T2V生成器上均具有良好的泛化能力。代码和模型将公开于https://guoyww.github.io/projects/SparseCtrl。