DiffX: Guide Your Layout to Cross-Modal Generative Modeling

Diffusion models have made significant strides in text-driven and layout-driven image generation. However, most diffusion models are limited to visible RGB image generation. In fact, human perception of the world is enriched by diverse viewpoints, including chromatic contrast, thermal illumination, and depth information. In this paper, we introduce a novel diffusion model for general layout-guided cross-modal "RGB+X" generation, called DiffX. We firstly construct the cross-modal image datasets with text descriptions using the LLaVA model for image captioning, supplemented by manual corrections. Notably, DiffX presents a simple yet effective cross-modal generative modeling pipeline, which conducts diffusion and denoising processes in the modality-shared latent space, facilitated by our Dual-Path Variational AutoEncoder (DP-VAE). Furthermore, we incorporate the gated cross-attention mechanism to connect the layout and text conditions, leveraging Long-CLIP for embedding long captions to enhance user guidance. Through extensive experiments, DiffX demonstrates robustness and flexibility in cross-modal generation across three RGB+X datasets: FLIR, MFNet, and COME15K, guided by various layout types. It also shows the potential for adaptive generation of "RGB+X+Y" or more diverse modalities. Our code and processed image captions are available at https://github.com/zeyuwang-zju/DiffX.

翻译：扩散模型在文本驱动和布局驱动的图像生成方面取得了显著进展。然而，大多数扩散模型仅限于可见的RGB图像生成。事实上，人类对世界的感知因多样的视角而丰富，包括色彩对比、热辐射照明和深度信息。本文提出了一种新颖的扩散模型，用于通用的布局引导跨模态"RGB+X"生成，称为DiffX。我们首先利用LLaVA模型进行图像描述，辅以人工修正，构建了带有文本描述的跨模态图像数据集。值得注意的是，DiffX提出了一种简单而有效的跨模态生成建模流程，该流程在我们提出的双路径变分自编码器（DP-VAE）的促进下，在模态共享的潜在空间中进行扩散和去噪过程。此外，我们引入了门控交叉注意力机制来连接布局和文本条件，并利用Long-CLIP嵌入长描述以增强用户引导。通过大量实验，DiffX在FLIR、MFNet和COME15K三个RGB+X数据集上，在不同布局类型的引导下，展示了跨模态生成的鲁棒性和灵活性。它还显示了自适应生成"RGB+X+Y"或更多样化模态的潜力。我们的代码和处理后的图像描述可在https://github.com/zeyuwang-zju/DiffX获取。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日