Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth or StyleDrop). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight $\textit{Spatial Adapters}$ that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on $\textit{"frozen videos"}$ (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel $\textit{Motion Adapter}$ module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.
翻译:定制化文本到图像(T2I)模型近年来取得了巨大进展,特别是在个性化、风格化和条件生成等领域。然而,将这一进展扩展到视频生成仍处于起步阶段,这主要是由于缺乏定制化的视频数据。在本工作中,我们提出了Still-Moving,一种新颖的通用框架,用于定制文本到视频(T2V)模型,而无需任何定制视频数据。该框架适用于主流的T2V设计,即视频模型构建在文本到图像(T2I)模型之上(例如,通过膨胀方法)。我们假设可以访问一个仅在静态图像数据上训练过的定制化T2I模型(例如,使用DreamBooth或StyleDrop)。简单地将定制化T2I模型的权重插入到T2V模型中通常会导致明显的伪影或对定制数据遵循不足。为了克服这个问题,我们训练了轻量级的$\textit{空间适配器}$,用于调整由注入的T2I层产生的特征。重要的是,我们的适配器是在$\textit{"冻结视频"}$(即重复的图像)上训练的,这些视频由定制化T2I模型生成的图像样本构建而成。这种训练得益于一个新颖的$\textit{运动适配器}$模块,它使我们能够在这种静态视频上进行训练,同时保留视频模型的运动先验。在测试时,我们移除运动适配器模块,只保留训练好的空间适配器。这恢复了T2V模型的运动先验,同时遵循了定制化T2I模型的空间先验。我们在包括个性化、风格化和条件生成在内的多种任务上证明了我们方法的有效性。在所有评估的场景中,我们的方法都能将定制化T2I模型的空间先验与T2V模型提供的运动先验无缝集成。