Text-guided diffusion models such as DALLE-2, Imagen, and Stable Diffusion are able to generate an effectively endless variety of images given only a short text prompt describing the desired image content. In many cases the images are of very high quality. However, these models often struggle to compose scenes containing several key objects such as characters in specified positional relationships. The missing capability to "direct" the placement of characters and objects both within and across images is crucial in storytelling, as recognized in the literature on film and animation theory. In this work, we take a particularly straightforward approach to providing the needed direction. Drawing on the observation that the cross-attention maps for prompt words reflect the spatial layout of objects denoted by those words, we introduce an optimization objective that produces ``activation'' at desired positions in these cross-attention maps. The resulting approach is a step toward generalizing the applicability of text-guided diffusion models beyond single images to collections of related images, as in storybooks. To the best of our knowledge, our Directed Diffusion method is the first diffusion technique that provides positional control over multiple objects, while making use of an existing pre-trained model and maintaining a coherent blend between the positioned objects and the background. Moreover, it requires only a few lines to implement.
翻译:文本引导扩散模型(如DALL-E 2、Imagen和Stable Diffusion)仅需简短文本提示描述期望图像内容,即可生成几乎无限多样的图像。多数情况下,这些图像具有极高品质。然而,这些模型在组合包含多个关键物体(例如具有特定位置关系的角色)的场景时常常表现不佳。正如电影与动画理论文献所公认的,这种缺乏在图像内及跨图像间"引导"角色与物体放置的能力,对于叙事至关重要。本研究提出一种极为直接的方法来提供所需的方向性引导。基于提示词交叉注意力图反映其所指代物体空间布局的观察,我们引入一种优化目标,可在这些交叉注意力图中产生期望位置的"激活"。该方法朝着将文本引导扩散模型的适用性从单张图像扩展到故事书等多张关联图像集合的方向迈进一步。据我们所知,本研究的定向扩散方法是首个在利用现有预训练模型的同时,能对多个物体进行位置控制,并保持定位物体与背景之间连贯融合的扩散技术。此外,该方法仅需数行代码即可实现。