Text-to-image models give rise to workflows which often begin with an exploration step, where users sift through a large collection of generated images. The global nature of the text-to-image generation process prevents users from narrowing their exploration to a particular object in the image. In this paper, we present a technique to generate a collection of images that depicts variations in the shape of a specific object, enabling an object-level shape exploration process. Creating plausible variations is challenging as it requires control over the shape of the generated object while respecting its semantics. A particular challenge when generating object variations is accurately localizing the manipulation applied over the object's shape. We introduce a prompt-mixing technique that switches between prompts along the denoising process to attain a variety of shape choices. To localize the image-space operation, we present two techniques that use the self-attention layers in conjunction with the cross-attention layers. Moreover, we show that these localization techniques are general and effective beyond the scope of generating object variations. Extensive results and comparisons demonstrate the effectiveness of our method in generating object variations, and the competence of our localization techniques.
翻译:文本生成图像模型催生了多种工作流程,这些流程通常始于探索步骤,用户需从中筛选大量生成图像。然而,文本生成图像过程的全局性阻止用户将探索范围缩小至图像中的特定物体。本文提出一种技术,能够生成一组展示特定物体形状变化的图像集合,从而实现物体级的形状探索过程。生成合理的形状变化具有挑战性,因为这需要在保持物体语义的同时控制其生成形状。在生成物体变化时,一个关键难题在于准确定位施加于物体形状上的操作。我们引入一种提示混合技术,该技术在去噪过程中切换不同提示,以获得多种形状选择。为实现图像空间操作的定位,我们提出两种技术,结合自注意力层与交叉注意力层。此外,我们证明这些定位技术具有通用性,且能有效应用于生成物体变化之外的场景。大量实验结果与对比分析表明,本方法在生成物体变化方面具备有效性,同时所提定位技术表现出优异性能。