In this study, we aim to extend the capabilities of diffusion-based text-to-image (T2I) generation models by incorporating diverse modalities beyond textual description, such as sketch, box, color palette, and style embedding, within a single model. We thus design a multimodal T2I diffusion model, coined as DiffBlender, by separating the channels of conditions into three types, i.e., image forms, spatial tokens, and non-spatial tokens. The unique architecture of DiffBlender facilitates adding new input modalities, pioneering a scalable framework for conditional image generation. Notably, we achieve this without altering the parameters of the existing generative model, Stable Diffusion, only with updating partial components. Our study establishes new benchmarks in multimodal generation through quantitative and qualitative comparisons with existing conditional generation methods. We demonstrate that DiffBlender faithfully blends all the provided information and showcase its various applications in the detailed image synthesis.
翻译:本研究旨在扩展基于扩散的文本到图像(T2I)生成模型的能力,使其在单一模型中融合除文本描述外的多种模态,如草图、边界框、调色板和风格嵌入。为此,我们设计了一种多模态T2I扩散模型,命名为DiffBlender,通过将条件通道分为三类:图像形式、空间标记和非空间标记。DiffBlender的独特架构便于添加新的输入模态,开创了条件图像生成的可扩展框架。值得注意的是,我们无需修改现有生成模型Stable Diffusion的参数,仅通过更新部分组件即可实现这一目标。通过与现有条件生成方法的定量和定性比较,我们的研究为多模态生成设立了新基准。我们证明DiffBlender能忠实融合所有提供的信息,并展示其在精细图像合成中的多种应用。