In recent years, diffusion models have made remarkable strides in text-to-video generation, sparking a quest for enhanced control over video outputs to more accurately reflect user intentions. Traditional efforts predominantly focus on employing either semantic cues, like images or depth maps, or motion-based conditions, like moving sketches or object bounding boxes. Semantic inputs offer a rich scene context but lack detailed motion specificity; conversely, motion inputs provide precise trajectory information but miss the broader semantic narrative. For the first time, we integrate both semantic and motion cues within a diffusion model for video generation, as demonstrated in Fig 1. To this end, we introduce the Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs. It incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions, promoting synergy between different modalities. For model training, we separate the conditions for the two modalities, introducing a two-stage training pipeline. Experimental results demonstrate that our design significantly enhances video quality, motion precision, and semantic coherence.
翻译:近年来,扩散模型在文本到视频生成领域取得了显著进展,这引发了对视频输出增强控制的探索,以更准确地反映用户意图。传统方法主要侧重于使用语义线索(如图像或深度图)或基于运动的条件(如运动草图或物体边界框)。语义输入提供了丰富的场景上下文,但缺乏详细的运动特异性;相反,运动输入提供了精确的轨迹信息,却缺失了更广泛的语义叙事。我们首次在视频生成的扩散模型中整合了语义和运动线索,如图1所示。为此,我们提出了场景与运动条件扩散模型(SMCD),这是一种管理多模态输入的新方法。它包含一个公认的运动条件模块,并研究了整合场景条件的多种方法,以促进不同模态之间的协同作用。对于模型训练,我们分离了两种模态的条件,引入了一个两阶段的训练流程。实验结果表明,我们的设计显著提升了视频质量、运动精度和语义连贯性。