Video inpainting is the task of filling a desired region in a video in a visually convincing manner. It is a very challenging task due to the high dimensionality of the signal and the temporal consistency required for obtaining convincing results. Recently, diffusion models have shown impressive results in modeling complex data distributions, including images and videos. Diffusion models remain nonetheless very expensive to train and perform inference with, which strongly restrict their application to video. We show that in the case of video inpainting, thanks to the highly auto-similar nature of videos, the training of a diffusion model can be restricted to the video to inpaint and still produce very satisfying results. This leads us to adopt an internal learning approch, which also allows for a greatly reduced network size. We call our approach "Infusion": an internal learning algorithm for video inpainting through diffusion. Due to our frugal network, we are able to propose the first video inpainting approach based purely on diffusion. Other methods require supporting elements such as optical flow estimation, which limits their performance in the case of dynamic textures for example. We introduce a new method for efficient training and inference of diffusion models in the context of internal learning. We split the diffusion process into different learning intervals which greatly simplifies the learning steps. We show qualititative and quantitative results, demonstrating that our method reaches state-of-the-art performance, in particular in the case of dynamic backgrounds and textures.
翻译:视频修复是以视觉可信的方式填充视频中指定区域的任务。由于信号的高维特性以及获得可信结果所需的时间一致性,这是一项极具挑战性的任务。近年来,扩散模型在模拟包括图像和视频在内的复杂数据分布方面展示了令人瞩目的成果。然而,扩散模型的训练与推理成本依然非常高昂,这严重限制了其在视频领域的应用。我们证明,在视频修复任务中,得益于视频高度自相似的性质,扩散模型的训练可以仅局限于待修复的视频,并仍能产生令人满意的结果。这促使我们采用内部学习方法,同时还能大幅缩小网络规模。我们将该方法称为"Infusion":一种基于扩散的视频修复内部学习算法。凭借我们简约的网络结构,我们首次提出了纯粹基于扩散的视频修复方法。其他方法需要光流估计等辅助手段,这限制了其在动态纹理等场景下的性能。我们提出了一种在内部学习背景下高效训练和推理扩散模型的新方法:将扩散过程划分为不同的学习区间,从而极大简化了学习步骤。我们展示了定性和定量结果,证明我们的方法达到了最先进的性能,尤其在动态背景和纹理场景中表现突出。