Identifying and Solving Conditional Image Leakage in Image-to-Video Diffusion Model

Diffusion models have obtained substantial progress in image-to-video generation. However, in this paper, we find that these models tend to generate videos with less motion than expected. We attribute this to the issue called conditional image leakage, where the image-to-video diffusion models (I2V-DMs) tend to over-rely on the conditional image at large time steps. We further address this challenge from both inference and training aspects. First, we propose to start the generation process from an earlier time step to avoid the unreliable large-time steps of I2V-DMs, as well as an initial noise distribution with optimal analytic expressions (Analytic-Init) by minimizing the KL divergence between it and the actual marginal distribution to bridge the training-inference gap. Second, we design a time-dependent noise distribution (TimeNoise) for the conditional image during training, applying higher noise levels at larger time steps to disrupt it and reduce the model's dependency on it. We validate these general strategies on various I2V-DMs on our collected open-domain image benchmark and the UCF101 dataset. Extensive results show that our methods outperform baselines by producing higher motion scores with lower errors while maintaining image alignment and temporal consistency, thereby yielding superior overall performance and enabling more accurate motion control. The project page: \url{https://cond-image-leak.github.io/}.

翻译：扩散模型在图像到视频生成领域取得了显著进展。然而，本文发现这些模型生成的视频往往包含比预期更少的运动。我们将此归因于条件图像泄漏问题，即图像到视频扩散模型在大时间步长上过度依赖条件图像。我们从推理和训练两个层面进一步应对这一挑战。首先，我们提出从更早的时间步开始生成过程，以规避图像到视频扩散模型在大时间步上的不可靠性，同时通过最小化其与实际边缘分布之间的KL散度，设计具有最优解析表达式的初始噪声分布（Analytic-Init），以弥合训练与推理之间的差距。其次，我们在训练过程中为条件图像设计了时间依赖的噪声分布（TimeNoise），在较大时间步施加更高噪声水平以干扰条件图像，从而降低模型对其的依赖性。我们在自收集的开放域图像基准和UCF101数据集上，针对多种图像到视频扩散模型验证了这些通用策略。大量实验结果表明，我们的方法在保持图像对齐性和时序一致性的同时，能以更低的误差产生更高的运动评分，从而获得更优的整体性能，并实现更精确的运动控制，性能优于基线方法。项目页面：\url{https://cond-image-leak.github.io/}。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/