Multi-turn compositional image generation (M-CIG) is a challenging task that aims to iteratively manipulate a reference image given a modification text. While most of the existing methods for M-CIG are based on generative adversarial networks (GANs), recent advances in image generation have demonstrated the superiority of diffusion models over GANs. In this paper, we propose a diffusion-based method for M-CIG named conditional denoising diffusion with image compositional matching (CDD-ICM). We leverage CLIP as the backbone of image and text encoders, and incorporate a gated fusion mechanism, originally proposed for question answering, to compositionally fuse the reference image and the modification text at each turn of M-CIG. We introduce a conditioning scheme to generate the target image based on the fusion results. To prioritize the semantic quality of the generated target image, we learn an auxiliary image compositional match (ICM) objective, along with the conditional denoising diffusion (CDD) objective in a multi-task learning framework. Additionally, we also perform ICM guidance and classifier-free guidance to improve performance. Experimental results show that CDD-ICM achieves state-of-the-art results on two benchmark datasets for M-CIG, i.e., CoDraw and i-CLEVR.
翻译:多轮组合图像生成(M-CIG)是一项具有挑战性的任务,旨在根据修改文本对参考图像进行迭代操作。尽管现有的大多数M-CIG方法基于生成对抗网络(GANs),但近年来图像生成领域的进展已证明扩散模型优于GANs。本文提出一种基于扩散的M-CIG方法,即条件去噪扩散与图像组合匹配(CDD-ICM)。我们采用CLIP作为图像和文本编码器的主干,并引入原本用于问答的门控融合机制,在M-CIG的每一轮中组合融合参考图像和修改文本。我们设计了一种条件方案,基于融合结果生成目标图像。为优先保证生成目标图像的语义质量,我们在多任务学习框架中,除了条件去噪扩散(CDD)目标外,还学习辅助的图像组合匹配(ICM)目标。此外,我们还采用ICM引导和无分类器引导以提升性能。实验结果表明,CDD-ICM在M-CIG的两个基准数据集CoDraw和i-CLEVR上取得了最先进的结果。