Diffusion models have emerged as a powerful tool for generating high-quality images from textual descriptions. Despite their successes, these models often exhibit limited diversity in the sampled images, particularly when sampling with a high classifier-free guidance weight. To address this issue, we present Kaleido, a novel approach that enhances the diversity of samples by incorporating autoregressive latent priors. Kaleido integrates an autoregressive language model that encodes the original caption and generates latent variables, serving as abstract and intermediary representations for guiding and facilitating the image generation process. In this paper, we explore a variety of discrete latent representations, including textual descriptions, detection bounding boxes, object blobs, and visual tokens. These representations diversify and enrich the input conditions to the diffusion models, enabling more diverse outputs. Our experimental results demonstrate that Kaleido effectively broadens the diversity of the generated image samples from a given textual description while maintaining high image quality. Furthermore, we show that Kaleido adheres closely to the guidance provided by the generated latent variables, demonstrating its capability to effectively control and direct the image generation process.
翻译:扩散模型已成为从文本描述生成高质量图像的强大工具。尽管取得了成功,但这些模型在采样图像时往往表现出有限的多样性,尤其是在使用高分类器自由引导权重进行采样时。为解决这一问题,我们提出了Kaleido,一种通过引入自回归潜在先验来增强样本多样性的新方法。Kaleido集成了一个自回归语言模型,该模型对原始描述进行编码并生成潜在变量,作为引导和促进图像生成过程的抽象中间表示。本文探索了多种离散潜在表示,包括文本描述、检测边界框、对象斑块和视觉标记。这些表示多样化并丰富了扩散模型的输入条件,从而实现了更多样化的输出。我们的实验结果表明,Kaleido在保持高图像质量的同时,有效拓宽了给定文本描述所生成图像样本的多样性。此外,我们证明Kaleido能够紧密遵循生成的潜在变量所提供的引导,展示了其有效控制和指导图像生成过程的能力。