The recent progress in diffusion-based text-to-image generation models has significantly expanded generative capabilities via conditioning the text descriptions. However, since relying solely on text prompts is still restrictive for fine-grained customization, we aim to extend the boundaries of conditional generation to incorporate diverse types of modalities, e.g., sketch, box, and style embedding, simultaneously. We thus design a multimodal text-to-image diffusion model, coined as DiffBlender, that achieves the aforementioned goal in a single model by training only a few small hypernetworks. DiffBlender facilitates a convenient scaling of input modalities, without altering the parameters of an existing large-scale generative model to retain its well-established knowledge. Furthermore, our study sets new standards for multimodal generation by conducting quantitative and qualitative comparisons with existing approaches. By diversifying the channels of conditioning modalities, DiffBlender faithfully reflects the provided information or, in its absence, creates imaginative generation.
翻译:近年来,基于扩散的文本到图像生成模型通过文本描述条件化显著扩展了生成能力。然而,由于仅依赖文本提示仍难以实现细粒度定制化,我们旨在将条件生成的边界拓展至同时整合多种模态类型(例如草稿、边界框和风格嵌入)。为此,我们设计了一种多模态文本到图像扩散模型,称为DiffBlender,该模型通过仅训练少量超网络,在单一模型中实现了上述目标。DiffBlender便于灵活扩展输入模态,无需修改现有大规模生成模型的参数以保留其成熟知识。此外,通过与现有方法进行定量和定性比较,我们的研究为多模态生成设立了新标准。通过多样化条件模态的通道,DiffBlender能忠实反映所提供的信息,或在信息缺失时生成富有想象力的内容。