In this paper, we present VideoGen, a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency using reference-guided latent diffusion. We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt, as a reference image to guide video generation. Then, we introduce an efficient cascaded latent diffusion module conditioned on both the reference image and the text prompt, for generating latent video representations, followed by a flow-based temporal upsampling step to improve the temporal resolution. Finally, we map latent video representations into a high-definition video through an enhanced video decoder. During training, we use the first frame of a ground-truth video as the reference image for training the cascaded latent diffusion module. The main characterises of our approach include: the reference image generated by the text-to-image model improves the visual fidelity; using it as the condition makes the diffusion model focus more on learning the video dynamics; and the video decoder is trained over unlabeled video data, thus benefiting from high-quality easily-available videos. VideoGen sets a new state-of-the-art in text-to-video generation in terms of both qualitative and quantitative evaluation.
翻译:本文提出VideoGen,一种基于参考引导潜扩散的文生视频方法,能够生成具有高帧保真度和强时间一致性的高清视频。我们利用现成的文生图模型(例如Stable Diffusion)根据文本提示生成高质量图像,作为引导视频生成的参考图像。随后,我们引入一种高效的级联潜扩散模块,该模块以参考图像和文本提示为条件,生成潜视频表示,并采用基于光流的时域上采样步骤提升时间分辨率。最后,通过增强的视频解码器将潜视频表示映射为高清视频。训练过程中,我们使用真实视频的首帧作为参考图像来训练级联潜扩散模块。本方法的主要特点包括:文生图模型生成的参考图像提升了视觉保真度;将其作为条件使扩散模型更专注于学习视频动态;视频解码器在无标签视频数据上训练,从而受益于高质量且易于获取的视频。在定性和定量评估中,VideoGen在文生视频领域均达到当前最优水平。