I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models

Text-guided image-to-video (I2V) generation aims to generate a coherent video that preserves the identity of the input image and semantically aligns with the input prompt. Existing methods typically augment pretrained text-to-video (T2V) models by either concatenating the image with noised video frames channel-wise before being fed into the model or injecting the image embedding produced by pretrained image encoders in cross-attention modules. However, the former approach often necessitates altering the fundamental weights of pretrained T2V models, thus restricting the model's compatibility within the open-source communities and disrupting the model's prior knowledge. Meanwhile, the latter typically fails to preserve the identity of the input image. We present I2V-Adapter to overcome such limitations. I2V-Adapter adeptly propagates the unnoised input image to subsequent noised frames through a cross-frame attention mechanism, maintaining the identity of the input image without any changes to the pretrained T2V model. Notably, I2V-Adapter only introduces a few trainable parameters, significantly alleviating the training cost and also ensures compatibility with existing community-driven personalized models and control tools. Moreover, we propose a novel Frame Similarity Prior to balance the motion amplitude and the stability of generated videos through two adjustable control coefficients. Our experimental results demonstrate that I2V-Adapter is capable of producing high-quality videos. This performance, coupled with its agility and adaptability, represents a substantial advancement in the field of I2V, particularly for personalized and controllable applications.

翻译：文本引导的图像到视频生成旨在生成连贯的视频，既能保留输入图像的身份特征，又能与输入提示在语义上对齐。现有方法通常通过两种方式增强预训练的文本到视频模型：在将图像与噪声帧按通道拼接后输入模型，或将预训练图像编码器生成的图像嵌入注入交叉注意力模块。然而，前者往往需要改变预训练文本到视频模型的基础权重，从而限制了模型在开源社区中的兼容性，并破坏了模型的先验知识；后者则通常无法保持输入图像的身份特征。本文提出I2V-Adapter以克服这些局限。I2V-Adapter通过跨帧注意力机制巧妙地将无噪声输入图像传播到后续噪声帧中，在不改变预训练文本到视频模型的前提下维持输入图像的身份特征。值得注意的是，I2V-Adapter仅引入少量可训练参数，显著降低了训练成本，同时确保与现有社区驱动的个性化模型和控制工具的兼容性。此外，我们提出一种新颖的帧相似性先验，通过两个可调节的控制系数来平衡生成视频的运动幅度与稳定性。实验结果表明，I2V-Adapter能够生成高质量视频。结合其轻量性与适应性，这一性能代表了图像到视频领域，特别是面向个性化与可控应用的重大进展。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日