Image completion is a task that aims to fill in the missing region of a masked image with plausible contents. However, existing image completion methods tend to fill in the missing region with the surrounding texture instead of hallucinating a visual instance that is suitable in accordance with the context of the scene. In this work, we propose a novel image completion model, dubbed ImComplete, that hallucinates the missing instance that harmonizes well with - and thus preserves - the original context. ImComplete first adopts a transformer architecture that considers the visible instances and the location of the missing region. Then, ImComplete completes the semantic segmentation masks within the missing region, providing pixel-level semantic and structural guidance. Finally, the image synthesis blocks generate photo-realistic content. We perform a comprehensive evaluation of the results in terms of visual quality (LPIPS and FID) and contextual preservation scores (CLIPscore and object detection accuracy) with COCO-panoptic and Visual Genome datasets. Experimental results show the superiority of ImComplete on various natural images.
翻译:图像补全是一项旨在用合理内容填充掩码图像中缺失区域的任务。然而,现有图像补全方法往往倾向于用周围纹理填充缺失区域,而非根据场景上下文生成与之协调的视觉实例。在本工作中,我们提出了一种名为ImComplete的新型图像补全模型,该模型能够生成与原始上下文协调一致(从而保持其完整性)的缺失实例。ImComplete首先采用了一种能够考虑可见实例及缺失区域位置的Transformer架构;随后,它在缺失区域内完成语义分割掩码,提供像素级语义与结构指引;最后,通过图像合成模块生成具有照片真实感的内容。我们使用COCO全景分割数据集和Visual Genome数据集,从视觉质量(LPIPS和FID)和上下文保持分数(CLIPscore和目标检测准确率)两方面对结果进行了全面评估。实验结果表明,ImComplete在各种自然图像上具有优越性。