Recent work in Video Frame Interpolation (VFI) tries to formulate VFI as a diffusion-based conditional image generation problem, synthesizing the intermediate frame given a random noise and neighboring frames. Due to the relatively high resolution of videos, Latent Diffusion Models (LDMs) are employed as the conditional generation model, where the autoencoder compresses images into latent representations for diffusion and then reconstructs images from these latent representations. Such a formulation poses a crucial challenge: VFI expects that the output is deterministically equal to the ground truth intermediate frame, but LDMs randomly generate a diverse set of different images when the model runs multiple times. The reason for the diverse generation is that the cumulative variance (variance accumulated at each step of generation) of generated latent representations in LDMs is large. This makes the sampling trajectory random, resulting in diverse rather than deterministic generations. To address this problem, we propose our unique solution: Frame Interpolation with Consecutive Brownian Bridge Diffusion. Specifically, we propose consecutive Brownian Bridge diffusion that takes a deterministic initial value as input, resulting in a much smaller cumulative variance of generated latent representations. Our experiments suggest that our method can improve together with the improvement of the autoencoder and achieve state-of-the-art performance in VFI, leaving strong potential for further enhancement.
翻译:近期视频帧插值研究尝试将该问题构建为基于扩散的条件图像生成任务,即在给定随机噪声与相邻帧的条件下合成中间帧。由于视频分辨率相对较高,研究采用潜在扩散模型作为条件生成模型,其中自编码器将图像压缩为潜在表示进行扩散,再从这些潜在表示重建图像。这一构建方式面临关键挑战:视频帧插值要求输出必须确定性地等于真实中间帧,但潜在扩散模型在多次运行时随机生成一组不同的图像。生成结果多样化的原因在于潜在扩散模型中生成潜在表示的累积方差(生成过程中每一步积累的方差)过大,导致采样轨迹随机化,从而产生多样化而非确定性的生成结果。为解决该问题,我们提出创新方案:基于连续布朗桥扩散的视频帧插值。具体而言,我们设计了连续布朗桥扩散方法,该方法以确定性初始值作为输入,从而显著降低生成潜在表示的累积方差。实验表明,我们的方法能够随着自编码器的改进而同步提升,在视频帧插值任务中达到最先进的性能,展现出强大的持续优化潜力。