Large-scale generative models are capable of producing high-quality images from detailed text descriptions. However, many aspects of an image are difficult or impossible to convey through text. We introduce self-guidance, a method that provides greater control over generated images by guiding the internal representations of diffusion models. We demonstrate that properties such as the shape, location, and appearance of objects can be extracted from these representations and used to steer sampling. Self-guidance works similarly to classifier guidance, but uses signals present in the pretrained model itself, requiring no additional models or training. We show how a simple set of properties can be composed to perform challenging image manipulations, such as modifying the position or size of objects, merging the appearance of objects in one image with the layout of another, composing objects from many images into one, and more. We also show that self-guidance can be used to edit real images. For results and an interactive demo, see our project page at https://dave.ml/selfguidance/
翻译:大规模生成模型能够根据详细的文本描述生成高质量的图像。然而,图像的许多方面难以甚至无法通过文本传达。我们提出自引导方法,通过引导扩散模型的内部表征来增强对生成图像的控制。我们证明了物体的形状、位置和外观等属性可以从这些表征中提取并用于引导采样过程。自引导机制与分类器引导类似,但利用预训练模型自身存在的信号,无需额外模型或训练。我们展示了如何将一组简单属性组合起来执行具有挑战性的图像操作,例如修改物体的位置或大小、将一幅图像中物体的外观与另一幅图像的布局融合、将多幅图像中的物体组合成一幅图像等。我们还表明自引导可用于编辑真实图像。结果和交互式演示请参见我们的项目页面:https://dave.ml/selfguidance/