Text-guided image-to-video (I2V) generation aims to generate a coherent video that preserves the identity of the input image and semantically aligns with the input prompt. Existing methods typically augment pretrained text-to-video (T2V) models by either concatenating the image with noised video frames channel-wise before being fed into the model or injecting the image embedding produced by pretrained image encoders in cross-attention modules. However, the former approach often necessitates altering the fundamental weights of pretrained T2V models, thus restricting the model's compatibility within the open-source communities and disrupting the model's prior knowledge. Meanwhile, the latter typically fails to preserve the identity of the input image. We present I2V-Adapter to overcome such limitations. I2V-Adapter adeptly propagates the unnoised input image to subsequent noised frames through a cross-frame attention mechanism, maintaining the identity of the input image without any changes to the pretrained T2V model. Notably, I2V-Adapter only introduces a few trainable parameters, significantly alleviating the training cost and also ensures compatibility with existing community-driven personalized models and control tools. Moreover, we propose a novel Frame Similarity Prior to balance the motion amplitude and the stability of generated videos through two adjustable control coefficients. Our experimental results demonstrate that I2V-Adapter is capable of producing high-quality videos. This performance, coupled with its agility and adaptability, represents a substantial advancement in the field of I2V, particularly for personalized and controllable applications.
翻译:文本引导的图像到视频生成旨在生成与输入图像保持一致性且语义对齐输入提示的连贯视频。现有方法通常通过两种方式增强预训练的文本到视频模型:在输入模型前将图像与噪声视频帧按通道维度拼接,或通过预训练图像编码器生成的图像嵌入注入交叉注意力模块。然而,前者方法往往需要修改预训练T2V模型的核心权重,从而限制模型在开源社区中的兼容性并破坏模型的先验知识;后者方法则通常无法保持输入图像的原始特征。我们提出I2V-Adapter以克服上述局限。该适配器通过跨帧注意力机制将无噪声输入图像高效传播至后续噪声帧,在无需修改预训练T2V模型的前提下保持输入图像特征。值得注意的是,I2V-Adapter仅引入少量可训练参数,显著降低训练成本,同时确保与社区驱动的个性化模型及控制工具的兼容性。此外,我们提出新颖的帧相似性先验,通过两个可调节控制系数平衡生成视频的运动幅度与稳定性。实验结果表明,I2V-Adapter能够生成高质量视频,其敏捷性与适应性标志着图像到视频生成领域,特别是面向个性化与可控应用场景的重要进展。