Recent advances in conditional generative image models have enabled impressive results. On the one hand, text-based conditional models have achieved remarkable generation quality, by leveraging large-scale datasets of image-text pairs. To enable fine-grained controllability, however, text-based models require long prompts, whose details may be ignored by the model. On the other hand, layout-based conditional models have also witnessed significant advances. These models rely on bounding boxes or segmentation maps for precise spatial conditioning in combination with coarse semantic labels. The semantic labels, however, cannot be used to express detailed appearance characteristics. In this paper, we approach fine-grained scene controllability through image collages which allow a rich visual description of the desired scene as well as the appearance and location of the objects therein, without the need of class nor attribute labels. We introduce "mixing and matching scenes" (M&Ms), an approach that consists of an adversarially trained generative image model which is conditioned on appearance features and spatial positions of the different elements in a collage, and integrates these into a coherent image. We train our model on the OpenImages (OI) dataset and evaluate it on collages derived from OI and MS-COCO datasets. Our experiments on the OI dataset show that M&Ms outperforms baselines in terms of fine-grained scene controllability while being very competitive in terms of image quality and sample diversity. On the MS-COCO dataset, we highlight the generalization ability of our model by outperforming DALL-E in terms of the zero-shot FID metric, despite using two magnitudes fewer parameters and data. Collage based generative models have the potential to advance content creation in an efficient and effective way as they are intuitive to use and yield high quality generations.
翻译:近期在条件生成图像模型方面的进展取得了令人瞩目的成果。一方面,基于文本的条件模型通过利用大规模图像-文本对数据集,实现了卓越的生成质量。然而,为了实现细粒度的可控性,文本模型需要冗长的提示词,模型可能忽略其中的细节。另一方面,基于布局的条件模型也取得了显著进展。这类模型依赖边界框或分割图实现精确的空间条件控制,并结合粗粒度语义标签。然而,语义标签无法用于表达详细的外观特征。本文通过图像拼贴实现细粒度的场景可控性,该技术能够以丰富的视觉方式描述目标场景及其中物体的外观与位置,无需类别或属性标签。我们提出"场景混合匹配"(M&Ms)方法,该方法包含一个对抗训练的图像生成模型,该模型以拼贴中各元素的视觉特征与空间位置为条件,并将其整合为连贯图像。我们在OpenImages(OI)数据集上训练模型,并在基于OI与MS-COCO数据集构建的拼贴上评估性能。在OI数据集上的实验表明,M&Ms在细粒度场景可控性方面优于基线方法,同时在图像质量和样本多样性方面极具竞争力。在MS-COCO数据集上,尽管参数量与数据量均低于DALL-E两个数量级,但我们的模型在零样本FID指标上仍超越后者,凸显了其泛化能力。基于拼贴的生成模型有望以高效方式推进内容创作,因其操作直观且能生成高质量图像。