Text-guided diffusion models such as DALLE-2, Imagen, eDiff-I, and Stable Diffusion are able to generate an effectively endless variety of images given only a short text prompt describing the desired image content. In many cases the images are of very high quality. However, these models often struggle to compose scenes containing several key objects such as characters in specified positional relationships. The missing capability to ``direct'' the placement of characters and objects both within and across images is crucial in storytelling, as recognized in the literature on film and animation theory. In this work, we take a particularly straightforward approach to providing the needed direction. Drawing on the observation that the cross-attention maps for prompt words reflect the spatial layout of objects denoted by those words, we introduce an optimization objective that produces ``activation'' at desired positions in these cross-attention maps. The resulting approach is a step toward generalizing the applicability of text-guided diffusion models beyond single images to collections of related images, as in storybooks. Directed Diffusion provides easy high-level positional control over multiple objects, while making use of an existing pre-trained model and maintaining a coherent blend between the positioned objects and the background. Moreover, it requires only a few lines to implement.
翻译:文本引导扩散模型(如DALLE-2、Imagen、eDiff-I和Stable Diffusion)仅需简短文本提示描述期望图像内容,即可有效生成数量无限的多样化图像。在多数情况下,这些图像质量极高。然而,这些模型在组合包含多个关键对象(如具有指定位置关系的角色)的场景时常常表现不佳。如电影与动画理论文献所公认,缺乏在图像内部及跨图像间“引导”角色与对象布局的能力,对于叙事至关重要。本研究采用特别直接的方法来提供所需的引导。基于提示词交叉注意力图能反映对应词所表征对象空间布局的观察,我们引入一个优化目标,在这些交叉注意力图中产生所需位置的“激活”。该方法是向文本引导扩散模型应用范围拓展的关键一步——从单张图像延伸至故事书中的关联图像集合。引导式扩散支持对多个对象进行简易的层级位置控制,同时利用现有预训练模型,保持定位对象与背景的连贯融合。此外,其实现仅需数行代码。