Recent advancements in diffusion models have shown great promise in producing high-quality video content. However, efficiently training diffusion models capable of integrating directional guidance and controllable motion intensity remains a challenging and under-explored area. This paper introduces Mojito, a diffusion model that incorporates both \textbf{Mo}tion tra\textbf{j}ectory and \textbf{i}ntensi\textbf{t}y contr\textbf{o}l for text to video generation. Specifically, Mojito features a Directional Motion Control module that leverages cross-attention to efficiently direct the generated object's motion without additional training, alongside a Motion Intensity Modulator that uses optical flow maps generated from videos to guide varying levels of motion intensity. Extensive experiments demonstrate Mojito's effectiveness in achieving precise trajectory and intensity control with high computational efficiency, generating motion patterns that closely match specified directions and intensities, providing realistic dynamics that align well with natural motion in real-world scenarios.
翻译:近年来,扩散模型在生成高质量视频内容方面展现出巨大潜力。然而,如何高效训练能够整合方向性引导与可控运动强度的扩散模型,仍是一个具有挑战性且尚未充分探索的领域。本文提出Mojito,一种融合**运**动轨**迹**与强**度**控**制**的文本到视频生成扩散模型。具体而言,Mojito设计了一个方向性运动控制模块,该模块利用交叉注意力机制高效引导生成对象的运动轨迹而无需额外训练;同时配备运动强度调节器,通过视频生成的光流图来引导不同等级的运动强度。大量实验表明,Mojito能够以高计算效率实现精确的轨迹与强度控制,生成的运动模式与指定方向和强度高度吻合,提供符合真实世界自然运动规律的逼真动态效果。