Controllable Image Generation via Collage Representations

Recent advances in conditional generative image models have enabled impressive results. On the one hand, text-based conditional models have achieved remarkable generation quality, by leveraging large-scale datasets of image-text pairs. To enable fine-grained controllability, however, text-based models require long prompts, whose details may be ignored by the model. On the other hand, layout-based conditional models have also witnessed significant advances. These models rely on bounding boxes or segmentation maps for precise spatial conditioning in combination with coarse semantic labels. The semantic labels, however, cannot be used to express detailed appearance characteristics. In this paper, we approach fine-grained scene controllability through image collages which allow a rich visual description of the desired scene as well as the appearance and location of the objects therein, without the need of class nor attribute labels. We introduce "mixing and matching scenes" (M&Ms), an approach that consists of an adversarially trained generative image model which is conditioned on appearance features and spatial positions of the different elements in a collage, and integrates these into a coherent image. We train our model on the OpenImages (OI) dataset and evaluate it on collages derived from OI and MS-COCO datasets. Our experiments on the OI dataset show that M&Ms outperforms baselines in terms of fine-grained scene controllability while being very competitive in terms of image quality and sample diversity. On the MS-COCO dataset, we highlight the generalization ability of our model by outperforming DALL-E in terms of the zero-shot FID metric, despite using two magnitudes fewer parameters and data. Collage based generative models have the potential to advance content creation in an efficient and effective way as they are intuitive to use and yield high quality generations.

翻译：近期在条件生成图像模型方面的进展取得了令人瞩目的成果。一方面，基于文本的条件模型通过利用大规模图像-文本对数据集，实现了卓越的生成质量。然而，为了实现细粒度的可控性，文本模型需要冗长的提示词，模型可能忽略其中的细节。另一方面，基于布局的条件模型也取得了显著进展。这类模型依赖边界框或分割图实现精确的空间条件控制，并结合粗粒度语义标签。然而，语义标签无法用于表达详细的外观特征。本文通过图像拼贴实现细粒度的场景可控性，该技术能够以丰富的视觉方式描述目标场景及其中物体的外观与位置，无需类别或属性标签。我们提出"场景混合匹配"（M&Ms）方法，该方法包含一个对抗训练的图像生成模型，该模型以拼贴中各元素的视觉特征与空间位置为条件，并将其整合为连贯图像。我们在OpenImages（OI）数据集上训练模型，并在基于OI与MS-COCO数据集构建的拼贴上评估性能。在OI数据集上的实验表明，M&Ms在细粒度场景可控性方面优于基线方法，同时在图像质量和样本多样性方面极具竞争力。在MS-COCO数据集上，尽管参数量与数据量均低于DALL-E两个数量级，但我们的模型在零样本FID指标上仍超越后者，凸显了其泛化能力。基于拼贴的生成模型有望以高效方式推进内容创作，因其操作直观且能生成高质量图像。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

47+阅读 · 2020年10月31日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日