Image completion is a task that aims to fill in the missing region of a masked image with plausible contents. However, existing image completion methods tend to fill in the missing region with the surrounding texture instead of hallucinating a visual instance that is suitable in accordance with the context of the scene. In this work, we propose a novel image completion model, dubbed ImComplete, that hallucinates the missing instance that harmonizes well with - and thus preserves - the original context. ImComplete first adopts a transformer architecture that considers the visible instances and the location of the missing region. Then, ImComplete completes the semantic segmentation masks within the missing region, providing pixel-level semantic and structural guidance. Finally, the image synthesis blocks generate photo-realistic content. We perform a comprehensive evaluation of the results in terms of visual quality (LPIPS and FID) and contextual preservation scores (CLIPscore and object detection accuracy) with COCO-panoptic and Visual Genome datasets. Experimental results show the superiority of ImComplete on various natural images.
翻译:图像补全是旨在用合理内容填充掩码图像缺失区域的任务。然而,现有图像补全方法倾向于用周围纹理填充缺失区域,而非根据场景上下文主动生成合适的视觉实例。在本文中,我们提出了一种名为ImComplete的新型图像补全模型,该模型能生成与原始上下文和谐共存的缺失实例,从而保留原始场景语义。ImComplete首先采用Transformer架构综合考虑可见实例与缺失区域位置;其次,在缺失区域内完成语义分割掩码补全,提供像素级语义与结构引导;最后通过图像合成模块生成逼真内容。我们利用COCO-panoptic和Visual Genome数据集,从视觉质量(LPIPS和FID)及上下文保留度(CLIPscore与目标检测准确率)两方面对结果进行了全面评估。实验结果表明ImComplete在各类自然图像上具有优越性能。