Recently, removing objects from videos and filling in the erased regions using deep video inpainting (VI) algorithms has attracted considerable attention. Usually, a video sequence and object segmentation masks for all frames are required as the input for this task. However, in real-world applications, providing segmentation masks for all frames is quite difficult and inefficient. Therefore, we deal with VI in a one-shot manner, which only takes the initial frame's object mask as its input. Although we can achieve that using naive combinations of video object segmentation (VOS) and VI methods, they are sub-optimal and generally cause critical errors. To address that, we propose a unified pipeline for one-shot video inpainting (OSVI). By jointly learning mask prediction and video completion in an end-to-end manner, the results can be optimal for the entire task instead of each separate module. Additionally, unlike the two stage methods that use the predicted masks as ground truth cues, our method is more reliable because the predicted masks can be used as the network's internal guidance. On the synthesized datasets for OSVI, our proposed method outperforms all others both quantitatively and qualitatively.
翻译:近年来,利用深度视频修复(VI)算法移除视频中的物体并填充被擦除区域引起了广泛关注。通常,该任务需要输入视频序列及所有帧的对象分割掩码。然而,在现实应用中,为所有帧提供分割掩码十分困难且低效。为此,我们采用一次性方式处理视频修复问题,仅以首帧的对象掩码作为输入。尽管可以通过简单组合视频对象分割(VOS)与视频修复方法实现此目标,但这类方法并非最优且常导致严重错误。为解决该问题,我们提出了一种统一的一次式视频修复(OSVI)流程。通过以端到端方式联合学习掩码预测与视频补全,我们能够针对整体任务(而非各独立模块)获得最优结果。此外,与将预测掩码作为真实值线索的两阶段方法不同,我们的方法更为可靠——预测掩码可作为网络的内部引导。在针对一次性视频修复任务构建的合成数据集上,本方法在定量与定性评估中均显著优于其他方法。