Diffusion models have made significant strides in text-driven and layout-driven image generation. However, most diffusion models are limited to visible RGB image generation. In fact, human perception of the world is enriched by diverse viewpoints, including chromatic contrast, thermal illumination, and depth information. In this paper, we introduce a novel diffusion model for general layout-guided cross-modal "RGB+X" generation, called DiffX. We firstly construct the cross-modal image datasets with text descriptions using the LLaVA model for image captioning, supplemented by manual corrections. Notably, DiffX presents a simple yet effective cross-modal generative modeling pipeline, which conducts diffusion and denoising processes in the modality-shared latent space, facilitated by our Dual-Path Variational AutoEncoder (DP-VAE). Furthermore, we incorporate the gated cross-attention mechanism to connect the layout and text conditions, leveraging Long-CLIP for embedding long captions to enhance user guidance. Through extensive experiments, DiffX demonstrates robustness and flexibility in cross-modal generation across three RGB+X datasets: FLIR, MFNet, and COME15K, guided by various layout types. It also shows the potential for adaptive generation of "RGB+X+Y" or more diverse modalities. Our code and processed image captions are available at https://github.com/zeyuwang-zju/DiffX.
翻译:扩散模型在文本驱动和布局驱动的图像生成方面取得了显著进展。然而,大多数扩散模型仅限于可见的RGB图像生成。事实上,人类对世界的感知因多样的视角而丰富,包括色彩对比、热辐射照明和深度信息。本文提出了一种新颖的扩散模型,用于通用的布局引导跨模态"RGB+X"生成,称为DiffX。我们首先利用LLaVA模型进行图像描述,辅以人工修正,构建了带有文本描述的跨模态图像数据集。值得注意的是,DiffX提出了一种简单而有效的跨模态生成建模流程,该流程在我们提出的双路径变分自编码器(DP-VAE)的促进下,在模态共享的潜在空间中进行扩散和去噪过程。此外,我们引入了门控交叉注意力机制来连接布局和文本条件,并利用Long-CLIP嵌入长描述以增强用户引导。通过大量实验,DiffX在FLIR、MFNet和COME15K三个RGB+X数据集上,在不同布局类型的引导下,展示了跨模态生成的鲁棒性和灵活性。它还显示了自适应生成"RGB+X+Y"或更多样化模态的潜力。我们的代码和处理后的图像描述可在https://github.com/zeyuwang-zju/DiffX获取。