Text-to-image diffusion models (T2I) have demonstrated unprecedented capabilities in creating realistic and aesthetic images. On the contrary, text-to-video diffusion models (T2V) still lag far behind in frame quality and text alignment, owing to insufficient quality and quantity of training videos. In this paper, we introduce VideoElevator, a training-free and plug-and-play method, which elevates the performance of T2V using superior capabilities of T2I. Different from conventional T2V sampling (i.e., temporal and spatial modeling), VideoElevator explicitly decomposes each sampling step into temporal motion refining and spatial quality elevating. Specifically, temporal motion refining uses encapsulated T2V to enhance temporal consistency, followed by inverting to the noise distribution required by T2I. Then, spatial quality elevating harnesses inflated T2I to directly predict less noisy latent, adding more photo-realistic details. We have conducted experiments in extensive prompts under the combination of various T2V and T2I. The results show that VideoElevator not only improves the performance of T2V baselines with foundational T2I, but also facilitates stylistic video synthesis with personalized T2I. Our code is available at https://github.com/YBYBZhang/VideoElevator.
翻译:文本到图像扩散模型(T2I)在生成逼真且美观的图像方面展现了前所未有的能力。然而,文本到视频扩散模型(T2V)由于训练视频在质量和数量上不足,在帧质量和文本对齐方面仍远落后。本文提出VideoElevator——一种无需训练且即插即用的方法,利用T2I的优越性能提升T2V的表现。与传统T2V采样(如时间和空间建模)不同,VideoElevator将每个采样步骤明确分解为时间运动精炼和空间质量提升。具体而言,时间运动精炼利用封装后的T2V增强时间一致性,随后通过逆变换映射到T2I所需的噪声分布;空间质量提升则借助扩展后的T2I直接预测噪声更少的潜在表示,从而增加更多逼真细节。我们在多种T2V与T2I组合的广泛提示下进行了实验。结果表明,VideoElevator不仅能用基础T2I改善T2V基线的性能,还能通过个性化T2I促进风格化视频合成。我们的代码已开源:https://github.com/YBYBZhang/VideoElevator。