Despite advancements in Text-to-Video (T2V) generation, producing videos with realistic motion remains challenging. Current models often yield static or minimally dynamic outputs, failing to capture complex motions described by text. This issue stems from the internal biases in text encoding, which overlooks motions, and inadequate conditioning mechanisms in T2V generation models. To address this, we propose a novel framework called DEcomposed MOtion (DEMO), which enhances motion synthesis in T2V generation by decomposing both text encoding and conditioning into content and motion components. Our method includes a content encoder for static elements and a motion encoder for temporal dynamics, alongside separate content and motion conditioning mechanisms. Crucially, we introduce text-motion and video-motion supervision to improve the model's understanding and generation of motion. Evaluations on benchmarks such as MSR-VTT, UCF-101, WebVid-10M, EvalCrafter, and VBench demonstrate DEMO's superior ability to produce videos with enhanced motion dynamics while maintaining high visual quality. Our approach significantly advances T2V generation by integrating comprehensive motion understanding directly from textual descriptions. Project page: https://PR-Ryan.github.io/DEMO-project/
翻译:尽管文本到视频(T2V)生成技术已取得进展,但生成具有真实感运动的视频仍然具有挑战性。现有模型通常产生静态或动态性极弱的输出,难以捕捉文本所描述的复杂运动。此问题源于文本编码中忽视运动的内在偏差,以及T2V生成模型中条件化机制的不完善。为解决这一问题,我们提出了一种名为DEcomposed MOtion(DEMO)的新框架,该框架通过将文本编码与条件化分解为内容与运动分量,以增强T2V生成中的运动合成。我们的方法包括用于静态元素的内容编码器、用于时序动态的运动编码器,以及独立的内容与运动条件化机制。关键的是,我们引入了文本-运动与视频-运动监督,以提升模型对运动的理解与生成能力。在MSR-VTT、UCF-101、WebVid-10M、EvalCrafter和VBench等基准测试上的评估表明,DEMO在保持高视觉质量的同时,能够生成具有增强运动动态的视频,展现出卓越的性能。我们的方法通过直接从文本描述中整合全面的运动理解,显著推进了T2V生成技术的发展。项目页面:https://PR-Ryan.github.io/DEMO-project/