Most existing video diffusion models (VDMs) are limited to mere text conditions. Thereby, they are usually lacking in control over visual appearance and geometry structure of the generated videos. This work presents Moonshot, a new video generation model that conditions simultaneously on multimodal inputs of image and text. The model builts upon a core module, called multimodal video block (MVB), which consists of conventional spatialtemporal layers for representing video features, and a decoupled cross-attention layer to address image and text inputs for appearance conditioning. In addition, we carefully design the model architecture such that it can optionally integrate with pre-trained image ControlNet modules for geometry visual conditions, without needing of extra training overhead as opposed to prior methods. Experiments show that with versatile multimodal conditioning mechanisms, Moonshot demonstrates significant improvement on visual quality and temporal consistency compared to existing models. In addition, the model can be easily repurposed for a variety of generative applications, such as personalized video generation, image animation and video editing, unveiling its potential to serve as a fundamental architecture for controllable video generation. Models will be made public on https://github.com/salesforce/LAVIS.
翻译:现有视频扩散模型大多局限于纯文本条件,因此通常难以控制生成视频的视觉外观和几何结构。本文提出Moonshot——一种同时以图像和文本多模态输入为条件的新型视频生成模型。该模型基于名为多模态视频块(MVB)的核心模块构建,该模块包含用于表示视频特征的常规时空层,以及用于处理图像和文本输入以实现外观条件的解耦交叉注意力层。此外,我们精心设计了模型架构,使其能够可选地与预训练图像ControlNet模块集成以实现几何视觉条件,且无需像先前方法那样引入额外训练开销。实验表明,凭借灵活的多模态条件机制,Moonshot在视觉质量和时序一致性上相较现有模型有显著提升。同时,该模型可轻松适配多种生成式应用场景,如个性化视频生成、图像动画和视频编辑,展现了其作为可控视频生成基础架构的潜力。模型将开源至 https://github.com/salesforce/LAVIS。