Text-to-image models give rise to workflows which often begin with an exploration step, where users sift through a large collection of generated images. The global nature of the text-to-image generation process prevents users from narrowing their exploration to a particular object in the image. In this paper, we present a technique to generate a collection of images that depicts variations in the shape of a specific object, enabling an object-level shape exploration process. Creating plausible variations is challenging as it requires control over the shape of the generated object while respecting its semantics. A particular challenge when generating object variations is accurately localizing the manipulation applied over the object's shape. We introduce a prompt-mixing technique that switches between prompts along the denoising process to attain a variety of shape choices. To localize the image-space operation, we present two techniques that use the self-attention layers in conjunction with the cross-attention layers. Moreover, we show that these localization techniques are general and effective beyond the scope of generating object variations. Extensive results and comparisons demonstrate the effectiveness of our method in generating object variations, and the competence of our localization techniques.
翻译:文本到图像模型催生了多种工作流程,这些流程通常起始于探索阶段,用户需在大量生成图像中筛选。由于文本到图像生成过程的全局性,用户无法将探索范围缩小至图像中的特定物体。本文提出一种技术,可生成一组描绘特定物体形状变化的图像,从而实现物体级形状探索过程。生成合理的变化具有挑战性,因为需要在控制物体形状的同时保持其语义一致性。生成物体变化时的一个特殊挑战是精确地定位施加于物体形状的操作。我们提出一种提示混合技术,在去噪过程中切换提示以获取多种形状选项。为实现图像空间操作的定位,我们提出了两种利用自注意力层与交叉注意力层相结合的技术。此外,我们证明这些定位技术具有通用性,且能有效应用于物体变化生成之外的场景。大量实验结果与对比展示了本方法在生成物体变化方面的有效性以及定位技术的优越性。