Diffusion models have obtained substantial progress in image-to-video generation. However, in this paper, we find that these models tend to generate videos with less motion than expected. We attribute this to the issue called conditional image leakage, where the image-to-video diffusion models (I2V-DMs) tend to over-rely on the conditional image at large time steps. We further address this challenge from both inference and training aspects. First, we propose to start the generation process from an earlier time step to avoid the unreliable large-time steps of I2V-DMs, as well as an initial noise distribution with optimal analytic expressions (Analytic-Init) by minimizing the KL divergence between it and the actual marginal distribution to bridge the training-inference gap. Second, we design a time-dependent noise distribution (TimeNoise) for the conditional image during training, applying higher noise levels at larger time steps to disrupt it and reduce the model's dependency on it. We validate these general strategies on various I2V-DMs on our collected open-domain image benchmark and the UCF101 dataset. Extensive results show that our methods outperform baselines by producing higher motion scores with lower errors while maintaining image alignment and temporal consistency, thereby yielding superior overall performance and enabling more accurate motion control. The project page: \url{https://cond-image-leak.github.io/}.
翻译:扩散模型在图像到视频生成领域取得了显著进展。然而,本文发现这些模型生成的视频往往包含比预期更少的运动。我们将此归因于条件图像泄漏问题,即图像到视频扩散模型在大时间步长上过度依赖条件图像。我们从推理和训练两个层面进一步应对这一挑战。首先,我们提出从更早的时间步开始生成过程,以规避图像到视频扩散模型在大时间步上的不可靠性,同时通过最小化其与实际边缘分布之间的KL散度,设计具有最优解析表达式的初始噪声分布(Analytic-Init),以弥合训练与推理之间的差距。其次,我们在训练过程中为条件图像设计了时间依赖的噪声分布(TimeNoise),在较大时间步施加更高噪声水平以干扰条件图像,从而降低模型对其的依赖性。我们在自收集的开放域图像基准和UCF101数据集上,针对多种图像到视频扩散模型验证了这些通用策略。大量实验结果表明,我们的方法在保持图像对齐性和时序一致性的同时,能以更低的误差产生更高的运动评分,从而获得更优的整体性能,并实现更精确的运动控制,性能优于基线方法。项目页面:\url{https://cond-image-leak.github.io/}。