Text-to-video (T2V) models have shown remarkable capabilities in generating diverse videos. However, they struggle to produce user-desired stylized videos due to (i) text's inherent clumsiness in expressing specific styles and (ii) the generally degraded style fidelity. To address these challenges, we introduce StyleCrafter, a generic method that enhances pre-trained T2V models with a style control adapter, enabling video generation in any style by providing a reference image. Considering the scarcity of stylized video datasets, we propose to first train a style control adapter using style-rich image datasets, then transfer the learned stylization ability to video generation through a tailor-made finetuning paradigm. To promote content-style disentanglement, we remove style descriptions from the text prompt and extract style information solely from the reference image using a decoupling learning strategy. Additionally, we design a scale-adaptive fusion module to balance the influences of text-based content features and image-based style features, which helps generalization across various text and style combinations. StyleCrafter efficiently generates high-quality stylized videos that align with the content of the texts and resemble the style of the reference images. Experiments demonstrate that our approach is more flexible and efficient than existing competitors.
翻译:文本到视频(T2V)模型在生成多样化视频方面展现出卓越能力。然而,由于(i)文本在表达特定风格时固有的笨拙性以及(ii)整体风格保真度降低,这些模型难以生成用户期望的风格化视频。为解决上述挑战,我们提出StyleCrafter——一种通用方法,通过风格控制适配器增强预训练T2V模型,使其能够通过参考图像生成任意风格的视频。针对风格化视频数据集稀缺的问题,我们提出先利用富含风格信息的图像数据集训练风格控制适配器,再通过定制的微调范式将习得的风格化能力迁移至视频生成。为促进内容-风格解耦,我们采用解耦学习策略,从文本提示中移除风格描述,仅从参考图像中提取风格信息。此外,我们设计了尺度自适应融合模块,以平衡基于文本的内容特征与基于图像的风格特征的影响,从而提升模型在不同文本与风格组合下的泛化能力。StyleCrafter能高效生成高质量风格化视频,其内容与文本描述对齐,并保留参考图像的风格特征。实验表明,我们的方法相比现有竞品更加灵活高效。