Language-guided image generation has achieved great success nowadays by using diffusion models. However, texts can be less detailed to describe highly-specific subjects such as a particular dog or a certain car, which makes pure text-to-image generation not accurate enough to satisfy user requirements. In this work, we present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences and generates customized images with the subjects. To be more specific, both input texts and images are encoded into one unified multi-modal latent space, in which the input images are learned to be projected to pseudo word embedding and can be further combined with text to guide image generation. Besides, to eliminate the irrelevant parts of the input images such as background or illumination, we propose a novel sampling technique of diffusion models used by the image generator which fuses the results guided by multi-modal input and pure text input. By leveraging the large-scale pre-trained text-to-image generator and the designed image encoder, our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
翻译:语言引导的图像生成近年来借助扩散模型取得了巨大成功。然而,文本难以详细描述高度具体的主体(如特定狗或某辆汽车),这使得纯文本到图像的生成不够精确,难以满足用户需求。本文提出了一种新颖的统一多模态潜在扩散模型(UMM-Diffusion),该模型将包含指定主体的文本与图像作为输入序列,并生成带有该主体的定制化图像。具体而言,输入文本和图像均被编码至统一的联合多模态潜在空间中,其中输入图像被学习投影为伪词嵌入,并可进一步与文本结合以引导图像生成。此外,为消除输入图像中无关部分(如背景或光照),我们提出了一种新颖的扩散模型采样技术,该技术由图像生成器使用,融合了多模态输入与纯文本输入的引导结果。通过利用大规模预训练的文本到图像生成器与设计的图像编码器,我们的方法能够从输入文本和图像两个维度生成具有复杂语义的高质量图像。