Diffusion models have been widely used for conditional data cross-modal generation tasks such as text-to-image and text-to-video. However, state-of-the-art models still fail to align the generated visual concepts with high-level semantics in a language such as object count, spatial relationship, etc. We approach this problem from a multimodal data fusion perspective and investigate how different fusion strategies can affect vision-language alignment. We discover that compared to the widely used early fusion of conditioning text in a pretrained image feature space, a specially designed intermediate fusion can: (i) boost text-to-image alignment with improved generation quality and (ii) improve training and inference efficiency by reducing low-rank text-to-image attention calculations. We perform experiments using a text-to-image generation task on the MS-COCO dataset. We compare our intermediate fusion mechanism with the classic early fusion mechanism on two common conditioning methods on a U-shaped ViT backbone. Our intermediate fusion model achieves a higher CLIP Score and lower FID, with 20% reduced FLOPs, and 50% increased training speed compared to a strong U-ViT baseline with an early fusion.
翻译:扩散模型已被广泛用于条件性跨模态数据生成任务,如文本到图像和文本到视频。然而,现有最先进的模型仍未能将生成的视觉概念与语言中的高层语义(如对象数量、空间关系等)进行准确对齐。我们从多模态数据融合的角度出发,研究不同融合策略对视觉-语言对齐的影响。我们发现,与在预训练图像特征空间中广泛使用的条件文本早期融合相比,一种特殊设计的中间融合能够:(i) 提升文本到图像的对齐质量及生成效果,(ii) 通过减少低秩文本-图像注意力计算,提高训练与推理效率。我们在MS-COCO数据集上通过文本到图像生成任务进行了实验。在U型ViT骨干网络上,我们将提出的中间融合机制与两种常见条件化方法中的经典早期融合机制进行了对比。我们的中间融合模型相较于采用早期融合的强U-ViT基线,实现了更高的CLIP分数和更低的FID,同时计算量(FLOPs)减少20%,训练速度提升50%。