In this paper, we present VideoGen, a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency using reference-guided latent diffusion. We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt, as a reference image to guide video generation. Then, we introduce an efficient cascaded latent diffusion module conditioned on both the reference image and the text prompt, for generating latent video representations, followed by a flow-based temporal upsampling step to improve the temporal resolution. Finally, we map latent video representations into a high-definition video through an enhanced video decoder. During training, we use the first frame of a ground-truth video as the reference image for training the cascaded latent diffusion module. The main characterises of our approach include: the reference image generated by the text-to-image model improves the visual fidelity; using it as the condition makes the diffusion model focus more on learning the video dynamics; and the video decoder is trained over unlabeled video data, thus benefiting from high-quality easily-available videos. VideoGen sets a new state-of-the-art in text-to-video generation in terms of both qualitative and quantitative evaluation. See \url{https://videogen.github.io/VideoGen/} for more samples.
翻译:本文提出了VideoGen,一种文本到视频生成方法,通过参考引导的潜在扩散技术生成具有高帧保真度和强时间一致性的高清视频。我们利用现有的文本到图像生成模型(如Stable Diffusion),从文本提示中生成内容质量高的图像,作为参考图像指导视频生成。随后,引入一个高效的级联潜在扩散模块,该模块以参考图像和文本提示为条件,生成潜在视频表示,并通过基于流的时间上采样步骤提升时间分辨率。最后,通过增强的视频解码器将潜在视频表示映射为高清视频。在训练过程中,我们使用真实视频的第一帧作为参考图像来训练级联潜在扩散模块。本方法的主要特点包括:文本到图像模型生成的参考图像提升了视觉保真度;将其作为条件使扩散模型更专注于学习视频动态;视频解码器在未标注视频数据上训练,从而受益于高质量且易获取的视频资源。在定性和定量评估中,VideoGen在文本到视频生成领域均达到了最新最优性能。更多样本请参见\url{https://videogen.github.io/VideoGen/}。