Recent video inpainting methods have made remarkable progress by utilizing explicit guidance, such as optical flow, to propagate cross-frame pixels. However, there are cases where cross-frame recurrence of the masked video is not available, resulting in a deficiency. In such situation, instead of borrowing pixels from other frames, the focus of the model shifts towards addressing the inverse problem. In this paper, we introduce a dual-modality-compatible inpainting framework called Deficiency-aware Masked Transformer (DMT), which offers three key advantages. Firstly, we pretrain a image inpainting model DMT_img serve as a prior for distilling the video model DMT_vid, thereby benefiting the hallucination of deficiency cases. Secondly, the self-attention module selectively incorporates spatiotemporal tokens to accelerate inference and remove noise signals. Thirdly, a simple yet effective Receptive Field Contextualizer is integrated into DMT, further improving performance. Extensive experiments conducted on YouTube-VOS and DAVIS datasets demonstrate that DMT_vid significantly outperforms previous solutions. The code and video demonstrations can be found at github.com/yeates/DMT.
翻译:近期视频修复方法通过利用显式引导(如光流)传播跨帧像素取得了显著进展。然而,当遮罩视频的跨帧重现性不可用时,会导致缺陷产生。在此类情况下,模型关注点从借用其他帧像素转向解决逆问题。本文提出一种双模态兼容的修复框架——缺陷感知遮罩Transformer(DMT),该框架具有三大优势:首先,我们预训练图像修复模型DMT_img作为蒸馏视频模型DMT_vid的先验知识,从而提升缺陷情形下的幻觉生成能力;其次,自注意力模块选择性融合时空标记以加速推理并消除噪声信号;最后,一种简单有效的感受野上下文编码器被集成至DMT中,进一步提升了性能。在YouTube-VOS和DAVIS数据集上的大量实验表明,DMT_vid显著优于先前解决方案。代码与视频演示请见github.com/yeates/DMT。