Current state-of-the-art methods for video inpainting typically rely on optical flow or attention-based approaches to inpaint masked regions by propagating visual information across frames. While such approaches have led to significant progress on standard benchmarks, they struggle with tasks that require the synthesis of novel content that is not present in other frames. In this paper we reframe video inpainting as a conditional generative modeling problem and present a framework for solving such problems with conditional video diffusion models. We highlight the advantages of using a generative approach for this task, showing that our method is capable of generating diverse, high-quality inpaintings and synthesizing new content that is spatially, temporally, and semantically consistent with the provided context.
翻译:当前最先进的视频修复方法通常依赖光流或基于注意力的方法,通过跨帧传播视觉信息来修复被遮蔽区域。虽然这些方法在标准基准测试中取得了显著进展,但在需要合成其他帧中不存在的全新内容时仍存在局限性。本文重新将视频修复定义为条件生成建模问题,并提出一个基于条件视频扩散模型来解决此类问题的框架。我们强调了生成式方法在该任务中的优势,展示了我们的方法能够生成多样化、高质量的视频修复结果,并合成与所提供上下文在空间、时间及语义上保持一致的全新内容。