Multi-turn compositional image generation (M-CIG) is a challenging task that aims to iteratively manipulate a reference image given a modification text. While most of the existing methods for M-CIG are based on generative adversarial networks (GANs), recent advances in image generation have demonstrated the superiority of diffusion models over GANs. In this paper, we propose a diffusion-based method for M-CIG named conditional denoising diffusion with image compositional matching (CDD-ICM). We leverage CLIP as the backbone of image and text encoders, and incorporate a gated fusion mechanism, originally proposed for question answering, to compositionally fuse the reference image and the modification text at each turn of M-CIG. We introduce a conditioning scheme to generate the target image based on the fusion results. To prioritize the semantic quality of the generated target image, we learn an auxiliary image compositional match (ICM) objective, along with the conditional denoising diffusion (CDD) objective in a multi-task learning framework. Additionally, we also perform ICM guidance and classifier-free guidance to improve performance. Experimental results show that CDD-ICM achieves state-of-the-art results on two benchmark datasets for M-CIG, i.e., CoDraw and i-CLEVR.
翻译:多轮组合式图像生成(M-CIG)是一项具有挑战性的任务,旨在根据修改文本对参考图像进行迭代操作。尽管现有的大多数M-CIG方法基于生成对抗网络(GAN),但近期图像生成领域的进展已证明扩散模型优于GAN。本文提出了一种基于扩散的M-CIG方法,命名为条件去噪扩散与图像组合匹配(CDD-ICM)。我们采用CLIP作为图像编码器和文本编码器的骨干网络,并引入原用于问答的门控融合机制,在M-CIG的每一轮中组合式地融合参考图像与修改文本。我们设计了一种条件化方案,基于融合结果生成目标图像。为了优先保证生成目标图像的语义质量,我们在多任务学习框架中,除了条件去噪扩散(CDD)目标外,还学习了一个辅助的图像组合匹配(ICM)目标。此外,我们还通过ICM引导和无分类器引导来提升性能。实验结果表明,CDD-ICM在两个M-CIG基准数据集(即CoDraw和i-CLEVR)上均达到了最优结果。