Existing text-to-video (T2V) models often struggle with generating videos with sufficiently pronounced or complex actions. A key limitation lies in the text prompt's inability to precisely convey intricate motion details. To address this, we propose a novel framework, MVideo, designed to produce long-duration videos with precise, fluid actions. MVideo overcomes the limitations of text prompts by incorporating mask sequences as an additional motion condition input, providing a clearer, more accurate representation of intended actions. Leveraging foundational vision models such as GroundingDINO and SAM2, MVideo automatically generates mask sequences, enhancing both efficiency and robustness. Our results demonstrate that, after training, MVideo effectively aligns text prompts with motion conditions to produce videos that simultaneously meet both criteria. This dual control mechanism allows for more dynamic video generation by enabling alterations to either the text prompt or motion condition independently, or both in tandem. Furthermore, MVideo supports motion condition editing and composition, facilitating the generation of videos with more complex actions. MVideo thus advances T2V motion generation, setting a strong benchmark for improved action depiction in current video diffusion models. Our project page is available at https://mvideo-v1.github.io/.
翻译:现有文本到视频(T2V)模型在生成具有足够显著或复杂动作的视频时常常面临困难。一个关键限制在于文本提示无法精确传达复杂的运动细节。为解决此问题,我们提出了一种新颖的框架MVideo,旨在生成具有精确、流畅动作的长时视频。MVideo通过引入掩码序列作为额外的运动条件输入,克服了文本提示的局限性,从而更清晰、更准确地表达了预期的动作。MVideo利用基础视觉模型(如GroundingDINO和SAM2)自动生成掩码序列,提高了效率和鲁棒性。我们的结果表明,经过训练后,MVideo能够有效地将文本提示与运动条件对齐,生成同时满足这两个标准的视频。这种双重控制机制允许通过独立修改文本提示或运动条件,或同时修改两者,来实现更具动态性的视频生成。此外,MVideo支持运动条件的编辑与组合,从而促进了具有更复杂动作的视频的生成。因此,MVideo推动了T2V运动生成的发展,为改进当前视频扩散模型中的动作描绘设定了强有力的基准。我们的项目页面位于 https://mvideo-v1.github.io/。