Recently, diffusion models have been used successfully to fit distributions for cross-modal data translation and multimodal data generation. However, these methods rely on extensive scaling, overlooking the inefficiency and interference between modalities. We develop Partially Shared U-Net (PS-U-Net) architecture which is an efficient multimodal diffusion model that allows text and image inputs to pass through dedicated layers and skip-connections for preserving modality-specific fine-grained details. Inspired by image inpainting, we also propose a new efficient multimodal sampling method that introduces new scenarios for conditional generation while only requiring a simple joint distribution to be learned. Our empirical exploration of the MS-COCO dataset demonstrates that our method generates multimodal text and image data with higher quality compared to existing multimodal diffusion models while having a comparable size, faster training, faster multimodal sampling, and more flexible generation.
翻译:近年来,扩散模型已成功用于跨模态数据转换与多模态数据生成的分布拟合。然而,这些方法依赖大规模扩展,忽略了模态间的低效性与干扰问题。我们提出部分共享U-Net(PS-U-Net)架构——一种高效的多模态扩散模型,该架构允许文本与图像输入通过专用层和跳跃连接传递,以保留模态特定的细粒度细节。受图像修复技术启发,我们还提出一种新的高效多模态采样方法,在仅需学习简单联合分布的条件下,即可引入条件生成的新场景。在MS-COCO数据集上的实证探索表明,与现有规模相当的多模态扩散模型相比,本方法生成的多模态文本与图像数据质量更高,且具备更快的训练速度、更高效的多模态采样及更灵活的数据生成能力。