Still-Moving: Customized Video Generation without Customized Video Data

Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth or StyleDrop). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight $\textit{Spatial Adapters}$ that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on $\textit{"frozen videos"}$ (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel $\textit{Motion Adapter}$ module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.

翻译：定制化文本到图像（T2I）模型近年来取得了巨大进展，特别是在个性化、风格化和条件生成等领域。然而，将这一进展扩展到视频生成仍处于起步阶段，这主要是由于缺乏定制化的视频数据。在本工作中，我们提出了Still-Moving，一种新颖的通用框架，用于定制文本到视频（T2V）模型，而无需任何定制视频数据。该框架适用于主流的T2V设计，即视频模型构建在文本到图像（T2I）模型之上（例如，通过膨胀方法）。我们假设可以访问一个仅在静态图像数据上训练过的定制化T2I模型（例如，使用DreamBooth或StyleDrop）。简单地将定制化T2I模型的权重插入到T2V模型中通常会导致明显的伪影或对定制数据遵循不足。为了克服这个问题，我们训练了轻量级的$\textit{空间适配器}$，用于调整由注入的T2I层产生的特征。重要的是，我们的适配器是在$\textit{"冻结视频"}$（即重复的图像）上训练的，这些视频由定制化T2I模型生成的图像样本构建而成。这种训练得益于一个新颖的$\textit{运动适配器}$模块，它使我们能够在这种静态视频上进行训练，同时保留视频模型的运动先验。在测试时，我们移除运动适配器模块，只保留训练好的空间适配器。这恢复了T2V模型的运动先验，同时遵循了定制化T2I模型的空间先验。我们在包括个性化、风格化和条件生成在内的多种任务上证明了我们方法的有效性。在所有评估的场景中，我们的方法都能将定制化T2I模型的空间先验与T2V模型提供的运动先验无缝集成。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日