Video generation and editing conditioned on text prompts or images have undergone significant advancements. However, challenges remain in accurately controlling global layout and geometry details solely by texts, and supporting motion control and local modification through images. In this paper, we aim to achieve sketch-based spatial and motion control for video generation and support fine-grained editing of real or synthetic videos. Based on the DiT video generation model, we propose a memory-efficient control structure with sketch control blocks that predict residual features of skipped DiT blocks. Sketches are drawn on one or two keyframes (at arbitrary time points) for easy interaction. To propagate such temporally sparse sketch conditions across all frames, we propose an inter-frame attention mechanism to analyze the relationship between the keyframes and each video frame. For sketch-based video editing, we design an additional video insertion module that maintains consistency between the newly edited content and the original video's spatial feature and dynamic motion. During inference, we use latent fusion for the accurate preservation of unedited regions. Extensive experiments demonstrate that our SketchVideo achieves superior performance in controllable video generation and editing.
翻译:基于文本提示或图像条件的视频生成与编辑已取得显著进展。然而,仅通过文本精确控制全局布局与几何细节仍面临挑战,且通过图像实现运动控制与局部修改的能力尚不完善。本文旨在实现基于草图的视频生成空间与运动控制,并支持对真实或合成视频进行细粒度编辑。基于DiT视频生成模型,我们提出一种内存高效的控制结构,其通过草图控制模块预测跳跃DiT模块的残差特征。草图可绘制于单帧或双帧关键帧(任意时间点)以实现便捷交互。为将此类时序稀疏的草图条件传播至所有帧,我们提出一种帧间注意力机制以分析关键帧与各视频帧间的关联关系。对于基于草图的视频编辑,我们设计了额外的视频插入模块,以保持新编辑内容与原始视频空间特征及动态运动间的一致性。在推理阶段,我们采用潜在融合技术以实现未编辑区域的精确保留。大量实验表明,我们的SketchVideo在可控视频生成与编辑任务中取得了卓越性能。