Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling

Diffusion models have emerged as a powerful tool for generating high-quality images from textual descriptions. Despite their successes, these models often exhibit limited diversity in the sampled images, particularly when sampling with a high classifier-free guidance weight. To address this issue, we present Kaleido, a novel approach that enhances the diversity of samples by incorporating autoregressive latent priors. Kaleido integrates an autoregressive language model that encodes the original caption and generates latent variables, serving as abstract and intermediary representations for guiding and facilitating the image generation process. In this paper, we explore a variety of discrete latent representations, including textual descriptions, detection bounding boxes, object blobs, and visual tokens. These representations diversify and enrich the input conditions to the diffusion models, enabling more diverse outputs. Our experimental results demonstrate that Kaleido effectively broadens the diversity of the generated image samples from a given textual description while maintaining high image quality. Furthermore, we show that Kaleido adheres closely to the guidance provided by the generated latent variables, demonstrating its capability to effectively control and direct the image generation process.

翻译：扩散模型已成为从文本描述生成高质量图像的强大工具。尽管取得了成功，但这些模型在采样图像时往往表现出有限的多样性，尤其是在使用高分类器自由引导权重进行采样时。为解决这一问题，我们提出了Kaleido，一种通过引入自回归潜在先验来增强样本多样性的新方法。Kaleido集成了一个自回归语言模型，该模型对原始描述进行编码并生成潜在变量，作为引导和促进图像生成过程的抽象中间表示。本文探索了多种离散潜在表示，包括文本描述、检测边界框、对象斑块和视觉标记。这些表示多样化并丰富了扩散模型的输入条件，从而实现了更多样化的输出。我们的实验结果表明，Kaleido在保持高图像质量的同时，有效拓宽了给定文本描述所生成图像样本的多样性。此外，我们证明Kaleido能够紧密遵循生成的潜在变量所提供的引导，展示了其有效控制和指导图像生成过程的能力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日