Large-scale generative models are capable of producing high-quality images from detailed text descriptions. However, many aspects of an image are difficult or impossible to convey through text. We introduce self-guidance, a method that provides greater control over generated images by guiding the internal representations of diffusion models. We demonstrate that properties such as the shape, location, and appearance of objects can be extracted from these representations and used to steer sampling. Self-guidance works similarly to classifier guidance, but uses signals present in the pretrained model itself, requiring no additional models or training. We show how a simple set of properties can be composed to perform challenging image manipulations, such as modifying the position or size of objects, merging the appearance of objects in one image with the layout of another, composing objects from many images into one, and more. We also show that self-guidance can be used to edit real images. For results and an interactive demo, see our project page at https://dave.ml/selfguidance/
翻译:大规模生成模型能够根据详细的文本描述生成高质量图像。然而,图像的许多方面难以或无法通过文本传达。我们提出自引导(self-guidance)方法,通过引导扩散模型的内部表示来增强对生成图像的控制。我们证明,物体的形状、位置和外观等属性可以从这些表示中提取,并用于引导采样过程。自引导的工作方式与分类器引导相似,但利用了预训练模型自身的信号,无需额外模型或训练。我们展示了如何组合简单的属性集来执行具有挑战性的图像操控任务,例如修改物体的位置或大小、将一幅图像中的物体外观与另一幅图像的布局融合、将多幅图像中的物体组合为一幅图像等。我们还证明,自引导可用于编辑真实图像。相关结果与交互式演示见项目页面:https://dave.ml/selfguidance/