Text-guided diffusion models such as DALLE-2, IMAGEN, and Stable Diffusion are able to generate an effectively endless variety of images given only a short text prompt describing the desired image content. In many cases the images are very high quality as well. However, these models often struggle to compose scenes containing several key objects such as characters in specified positional relationships. Unfortunately, this capability to ``direct'' the placement of characters and objects both within and across images is crucial in storytelling, as recognized in the literature on film and animation theory. In this work we take a particularly straightforward approach to providing the needed direction, by injecting ``activation'' at desired positions in the cross-attention maps corresponding to the objects under control, while attenuating the remainder of the map. The resulting approach is a step toward generalizing the applicability of text-guided diffusion models beyond single images to collections of related images, as in storybooks. To the best of our knowledge, our Directed Diffusion method is the first diffusion technique that provides positional control over multiple objects, while making use of an existing pre-trained model and maintaining a coherent blend between the positioned objects and the background. Moreover, it requires only a few lines to implement.
翻译:摘要:诸如DALLE-2、IMAGEN和Stable Diffusion等文本引导扩散模型,仅需提供描述期望图像内容的简短文本提示,便能生成几乎无限多样的图像。在许多情况下,这些图像的质量也相当高。然而,这些模型在合成包含多个关键物体(例如处于指定位置关系的角色)的场景时往往存在困难。不幸的是,这种在图像内部及跨图像"引导"角色和物体放置的能力,在叙事中至关重要——正如电影与动画理论文献所公认的那样。在本研究中,我们采用了一种尤为直接的方法来提供所需的引导:在被控对象对应的交叉注意力图中期望位置注入"激活",同时抑制注意力图的其余部分。该方法是推动文本引导扩散模型适用范围从单幅图像扩展至故事书类关联图像集合的重要一步。据我们所知,我们的定向扩散方法是首个利用现有预训练模型、能在维持定位物体与背景连贯融合的同时对多物体进行位置控制的扩散技术。此外,该方法仅需数行代码即可实现。