Large-scale generative models are capable of producing high-quality images from detailed text descriptions. However, many aspects of an image are difficult or impossible to convey through text. We introduce self-guidance, a method that provides greater control over generated images by guiding the internal representations of diffusion models. We demonstrate that properties such as the shape, location, and appearance of objects can be extracted from these representations and used to steer sampling. Self-guidance works similarly to classifier guidance, but uses signals present in the pretrained model itself, requiring no additional models or training. We show how a simple set of properties can be composed to perform challenging image manipulations, such as modifying the position or size of objects, merging the appearance of objects in one image with the layout of another, composing objects from many images into one, and more. We also show that self-guidance can be used to edit real images. For results and an interactive demo, see our project page at https://dave.ml/selfguidance/
翻译:大规模生成模型能够基于详细文本描述生成高质量图像。然而,图像的许多属性难以通过文本充分传达。我们提出自引导方法,通过引导扩散模型的内部表征来增强对生成图像的控制。实验表明,物体形状、位置和外观等属性可从这些表征中提取并用于指导采样过程。该方法的工作原理类似分类器引导,但利用预训练模型自身存在的信号,无需额外模型或训练。我们展示了如何通过组合简单属性实现复杂图像操作,包括调整物体位置或尺寸、融合某图物体外观与另一图布局、将多图物体合成至单帧等。此外,该方法还可用于编辑真实图像。相关结果及交互演示详见项目主页 https://dave.ml/selfguidance/