Given a small number of images of a subject, personalized image generation techniques can fine-tune large pre-trained text-to-image diffusion models to generate images of the subject in novel contexts, conditioned on text prompts. In doing so, a trade-off is made between prompt fidelity, subject fidelity and diversity. As the pre-trained model is fine-tuned, earlier checkpoints synthesize images with low subject fidelity but high prompt fidelity and diversity. In contrast, later checkpoints generate images with low prompt fidelity and diversity but high subject fidelity. This inherent trade-off limits the prompt fidelity, subject fidelity and diversity of generated images. In this work, we propose DreamBlend to combine the prompt fidelity from earlier checkpoints and the subject fidelity from later checkpoints during inference. We perform a cross attention guided image synthesis from a later checkpoint, guided by an image generated by an earlier checkpoint, for the same prompt. This enables generation of images with better subject fidelity, prompt fidelity and diversity on challenging prompts, outperforming state-of-the-art fine-tuning methods.
翻译:给定目标主体的少量图像,个性化图像生成技术能够对大型预训练的文本到图像扩散模型进行微调,从而根据文本提示生成该主体在新情境下的图像。在此过程中,需要在提示保真度、主体保真度和多样性之间做出权衡。随着预训练模型被微调,较早的检查点生成的图像主体保真度较低,但提示保真度和多样性较高。相反,较晚的检查点生成的图像提示保真度和多样性较低,但主体保真度较高。这种固有的权衡限制了生成图像的提示保真度、主体保真度和多样性。在本工作中,我们提出DreamBlend方法,在推理过程中结合较早检查点的提示保真度和较晚检查点的主体保真度。我们基于同一提示,利用较早检查点生成的图像作为引导,对较晚检查点执行交叉注意力引导的图像合成。这使得在具有挑战性的提示下,能够生成具有更佳主体保真度、提示保真度和多样性的图像,其性能优于当前最先进的微调方法。