We seek to give users precise control over diffusion-based image generation by modeling complex scenes as sequences of layers, which define the desired spatial arrangement and visual attributes of objects in the scene. Collage Diffusion harmonizes the input layers to make objects fit together -- the key challenge involves minimizing changes in the positions and key visual attributes of the input layers while allowing other attributes to change in the harmonization process. We ensure that objects are generated in the correct locations by modifying text-image cross-attention with the layers' alpha masks. We preserve key visual attributes of input layers by learning specialized text representations per layer and by extending ControlNet to operate on layers. Layer input allows users to control the extent of image harmonization on a per-object basis, and users can even iteratively edit individual objects in generated images while keeping other objects fixed. By leveraging the rich information present in layer input, Collage Diffusion generates globally harmonized images that maintain desired object characteristics better than prior approaches.
翻译:我们致力于通过将复杂场景建模为一系列图层(这些图层定义了场景中物体的期望空间布局和视觉属性)来赋予用户对基于扩散的图像生成的精确控制。拼贴扩散对输入图层进行调和以使物体协调一致——关键挑战在于最小化输入图层位置和关键视觉属性的变化,同时允许其他属性在调和过程中发生改变。我们通过使用图层的Alpha掩码修改文本-图像交叉注意力机制,确保物体在正确的位置生成。我们通过为每个图层学习专门的文本表示,并扩展ControlNet以在图层上操作,来保留输入图层的关键视觉属性。图层输入允许用户逐物体控制图像调和的程度,用户甚至可以在保持其他物体不变的情况下迭代编辑生成图像中的单个物体。通过利用图层输入中丰富的语义信息,拼贴扩散生成的全局调和图像在保留期望物体特征方面优于先前方法。