Text-to-video generation has lagged behind text-to-image synthesis in quality and diversity due to the complexity of spatio-temporal modeling and limited video-text datasets. This paper presents I4VGen, a training-free and plug-and-play video diffusion inference framework, which enhances text-to-video generation by leveraging robust image techniques. Specifically, following text-to-image-to-video, I4VGen decomposes the text-to-video generation into two stages: anchor image synthesis and anchor image-guided video synthesis. Correspondingly, a well-designed generation-selection pipeline is employed to achieve visually-realistic and semantically-faithful anchor image, and an innovative Noise-Invariant Video Score Distillation Sampling is incorporated to animate the image to a dynamic video, followed by a video regeneration process to refine the video. This inference strategy effectively mitigates the prevalent issue of non-zero terminal signal-to-noise ratio. Extensive evaluations show that I4VGen not only produces videos with higher visual realism and textual fidelity but also integrates seamlessly into existing image-to-video diffusion models, thereby improving overall video quality.
翻译:文本到视频生成在质量和多样性方面一直落后于文本到图像合成,这主要源于时空建模的复杂性以及视频-文本数据集的稀缺性。本文提出I4Gen,一种免训练即插即用的视频扩散推理框架,通过利用成熟的图像生成技术来增强文本到视频生成效果。具体而言,遵循“文本→图像→视频”的生成范式,I4VGen将文本到视频生成分解为两个阶段:锚定图像合成与锚定图像引导的视频合成。相应地,我们设计了生成-筛选流程以获得视觉逼真且语义准确的锚定图像,并引入创新的噪声不变视频分数蒸馏采样技术将静态图像动态化,最后通过视频再生过程对生成视频进行细化。该推理策略有效缓解了当前视频生成中普遍存在的非零终端信噪比问题。大量实验表明,I4VGen不仅能生成具有更高视觉真实感和文本忠实度的视频,还能无缝集成到现有图像到视频扩散模型中,从而全面提升视频生成质量。