Though diffusion-based video generation has witnessed rapid progress, the inference results of existing models still exhibit unsatisfactory temporal consistency and unnatural dynamics. In this paper, we delve deep into the noise initialization of video diffusion models, and discover an implicit training-inference gap that attributes to the unsatisfactory inference quality.Our key findings are: 1) the spatial-temporal frequency distribution of the initial noise at inference is intrinsically different from that for training, and 2) the denoising process is significantly influenced by the low-frequency components of the initial noise. Motivated by these observations, we propose a concise yet effective inference sampling strategy, FreeInit, which significantly improves temporal consistency of videos generated by diffusion models. Through iteratively refining the spatial-temporal low-frequency components of the initial latent during inference, FreeInit is able to compensate the initialization gap between training and inference, thus effectively improving the subject appearance and temporal consistency of generation results. Extensive experiments demonstrate that FreeInit consistently enhances the generation quality of various text-to-video diffusion models without additional training or fine-tuning.
翻译:尽管基于扩散的视频生成技术取得了快速进展,但现有模型的推理结果仍表现出不理想的时间一致性和非自然的动态效果。本文深入研究了视频扩散模型的噪声初始化,发现了一个导致推理质量不佳的隐式训练-推理差距。我们的关键发现是:1)推理时初始噪声的时空频率分布本质上与训练时不同;2)去噪过程显著受初始噪声低频分量的影响。基于这些观察,我们提出了一种简洁而有效的推理采样策略——FreeInit,它显著提高了扩散模型生成视频的时间一致性。通过在推理过程中迭代优化初始隐变量的时空低频分量,FreeInit能够弥补训练与推理之间的初始化差距,从而有效改善生成结果的主体外观和时间一致性。大量实验表明,FreeInit无需额外训练或微调,即可持续提升各种文本到视频扩散模型的生成质量。