MaGIC: Multi-modality Guided Image Completion

The vanilla image completion approaches are sensitive to the large missing regions due to limited available reference information for plausible generation. To mitigate this, existing methods incorporate the extra cue as a guidance for image completion. Despite improvements, these approaches are often restricted to employing a single modality (e.g., segmentation or sketch maps), which lacks scalability in leveraging multi-modality for more plausible completion. In this paper, we propose a novel, simple yet effective method for Multi-modal Guided Image Completion, dubbed MaGIC, which not only supports a wide range of single modality as the guidance (e.g., text, canny edge, sketch, segmentation, reference image, depth, and pose), but also adapts to arbitrarily customized combination of these modalities (i.e., arbitrary multi-modality) for image completion. For building MaGIC, we first introduce a modality-specific conditional U-Net (MCU-Net) that injects single-modal signal into a U-Net denoiser for single-modal guided image completion. Then, we devise a consistent modality blending (CMB) method to leverage modality signals encoded in multiple learned MCU-Nets through gradient guidance in latent space. Our CMB is training-free, and hence avoids the cumbersome joint re-training of different modalities, which is the secret of MaGIC to achieve exceptional flexibility in accommodating new modalities for completion. Experiments show the superiority of MaGIC over state-of-arts and its generalization to various completion tasks including in/out-painting and local editing. Our project with code and models is available at yeates.github.io/MaGIC-Page/.

翻译：摘要：常规图像补全方法在缺失区域较大时表现敏感，因其可利用的参考信息有限，难以生成合理的结果。为缓解这一问题，现有方法引入额外线索作为图像补全的引导。尽管有所改进，这些方法通常局限于使用单一模态（如分割图或素描图），缺乏利用多模态实现更合理补全的可扩展性。本文提出一种新颖、简单且有效的多模态引导图像补全方法，命名为MaGIC，它不仅支持多种单模态引导（如文本、Canny边缘、素描、分割、参考图像、深度和姿态），还能自适应地组合这些模态的任意定制化组合（即任意多模态）进行图像补全。构建MaGIC时，我们首先引入一种模态特定条件U-Net（MCU-Net），将单模态信号注入U-Net去噪器以实现单模态引导的图像补全。随后，我们设计了一种一致模态混合（CMB）方法，通过潜在空间中的梯度引导，利用多个已学习的MCU-Net编码的模态信号。我们的CMB无需训练，从而避免了不同模态的繁琐联合重训练，这正是MaGIC在容纳新模态进行补全时实现极高灵活性的秘诀。实验表明，MaGIC优于现有技术，并能泛化至包括内外绘画和局部编辑在内的多种补全任务。项目含代码和模型，详见yeates.github.io/MaGIC-Page/。