Deep video inpainting is typically used as malicious manipulation to remove important objects for creating fake videos. It is significant to identify the inpainted regions blindly. This letter proposes a simple yet effective forensic scheme for Video Inpainting LOcalization with ContrAstive Learning (ViLocal). Specifically, a 3D Uniformer encoder is applied to the video noise residual for learning effective spatiotemporal forensic features. To enhance the discriminative power, supervised contrastive learning is adopted to capture the local inconsistency of inpainted videos through attracting/repelling the positive/negative pristine and forged pixel pairs. A pixel-wise inpainting localization map is yielded by a lightweight convolution decoder with a specialized two-stage training strategy. To prepare enough training samples, we build a video object segmentation dataset of 2500 videos with pixel-level annotations per frame. Extensive experimental results validate the superiority of ViLocal over state-of-the-arts. Code and dataset will be available at https://github.com/multimediaFor/ViLocal.
翻译:深度视频修复通常被用作恶意篡改手段,通过移除关键对象来伪造视频。因此,盲检测修复区域具有重要意义。本文提出了一种简单而有效的取证方案——基于对比学习的视频修复区域定位(ViLocal)。具体而言,该方法将3D Uniformer编码器应用于视频噪声残差,以学习有效的时空取证特征。为增强判别能力,采用监督对比学习,通过拉近/推远正/负样本(原始像素对与伪造像素对)来捕捉修复视频的局部不一致性。通过配备专门两阶段训练策略的轻量卷积解码器,生成像素级修复定位图。为准备充足的训练样本,我们构建了一个包含2500个视频的视频对象分割数据集,每帧均带有像素级标注。大量实验结果验证了ViLocal相较于现有最优方法的优越性。代码与数据集将在https://github.com/multimediaFor/ViLocal 公开。