Despite the tremendous success in text-to-image generative models, localized text-to-image generation (that is, generating objects or features at specific locations in an image while maintaining a consistent overall generation) still requires either explicit training or substantial additional inference time. In this work, we show that localized generation can be achieved by simply controlling cross attention maps during inference. With no additional training, model architecture modification or inference time, our proposed cross attention control (CAC) provides new open-vocabulary localization abilities to standard text-to-image models. CAC also enhances models that are already trained for localized generation when deployed at inference time. Furthermore, to assess localized text-to-image generation performance automatically, we develop a standardized suite of evaluations using large pretrained recognition models. Our experiments show that CAC improves localized generation performance with various types of location information ranging from bounding boxes to semantic segmentation maps, and enhances the compositional capability of state-of-the-art text-to-image generative models.
翻译:尽管文本到图像生成模型取得了巨大成功,但局部化文本到图像生成(即在图像特定位置生成物体或特征,同时保持整体生成一致性)仍需要显式训练或大量额外推理时间。在本工作中,我们表明仅需在推理过程中控制交叉注意力图即可实现局部化生成。所提出的交叉注意力控制(CAC)无需额外训练、模型架构修改或推理时间,即可为标准文本到图像模型赋予全新的开放词汇定位能力。当部署于推理阶段时,CAC还能增强已针对局部化生成训练的模型性能。此外,为自动评估局部化文本到图像生成性能,我们利用大规模预训练识别模型开发了一套标准化评估套件。实验表明,CAC能利用从边界框到语义分割图等各类位置信息提升局部化生成性能,并增强当前最先进文本到图像生成模型的组合能力。