DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance

Image-to-video generation, which aims to generate a video starting from a given reference image, has drawn great attention. Existing methods try to extend pre-trained text-guided image diffusion models to image-guided video generation models. Nevertheless, these methods often result in either low fidelity or flickering over time due to their limitation to shallow image guidance and poor temporal consistency. To tackle these problems, we propose a high-fidelity image-to-video generation method by devising a frame retention branch based on a pre-trained video diffusion model, named DreamVideo. Instead of integrating the reference image into the diffusion process at a semantic level, our DreamVideo perceives the reference image via convolution layers and concatenates the features with the noisy latents as model input. By this means, the details of the reference image can be preserved to the greatest extent. In addition, by incorporating double-condition classifier-free guidance, a single image can be directed to videos of different actions by providing varying prompt texts. This has significant implications for controllable video generation and holds broad application prospects. We conduct comprehensive experiments on the public dataset, and both quantitative and qualitative results indicate that our method outperforms the state-of-the-art method. Especially for fidelity, our model has a powerful image retention ability and delivers the best results in UCF101 compared to other image-to-video models to our best knowledge. Also, precise control can be achieved by giving different text prompts. Further details and comprehensive results of our model will be presented in https://anonymous0769.github.io/DreamVideo/.

翻译：摘要：图像到视频生成任务旨在根据给定的参考图像生成动态视频，已引起学术界广泛关注。现有方法尝试将预训练的文本引导图像扩散模型扩展至图像引导视频生成领域，然而受限于浅层图像引导机制与时间一致性不足，此类方法常导致生成结果保真度低或出现时序闪烁伪影。为解决上述问题，本文提出一种基于预训练视频扩散模型的高保真图像到视频生成方法，通过设计帧保持分支实现，命名为DreamVideo。与现有方法在语义层面融合参考图像不同，DreamVideo通过卷积层感知参考图像，并将特征与含噪潜变量拼接作为模型输入。通过这种方式，参考图像的细节信息得以最大程度保留。此外，通过引入双条件无分类器引导机制，仅需改变提示文本即可将单张图像导向不同动作模式的视频序列。该方法对可控视频生成具有重要意义，并展现出广阔的应用前景。我们在公开数据集上开展了全面的对比实验，定量与定性结果均表明本方法优于当前最优水平。特别在保真度方面，据我们所知，所提模型在UCF101数据集中展现出最优的图像保持能力，性能超越现有图像到视频生成模型。同时，通过差异化的文本提示可实现精准控制。模型详细参数与完整实验数据将发布在https://anonymous0769.github.io/DreamVideo/。