An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models

Diffusion models have been widely used for conditional data cross-modal generation tasks such as text-to-image and text-to-video. However, state-of-the-art models still fail to align the generated visual concepts with high-level semantics in a language such as object count, spatial relationship, etc. We approach this problem from a multimodal data fusion perspective and investigate how different fusion strategies can affect vision-language alignment. We discover that compared to the widely used early fusion of conditioning text in a pretrained image feature space, a specially designed intermediate fusion can: (i) boost text-to-image alignment with improved generation quality and (ii) improve training and inference efficiency by reducing low-rank text-to-image attention calculations. We perform experiments using a text-to-image generation task on the MS-COCO dataset. We compare our intermediate fusion mechanism with the classic early fusion mechanism on two common conditioning methods on a U-shaped ViT backbone. Our intermediate fusion model achieves a higher CLIP Score and lower FID, with 20% reduced FLOPs, and 50% increased training speed compared to a strong U-ViT baseline with an early fusion.

翻译：扩散模型已被广泛用于条件性跨模态数据生成任务，如文本到图像和文本到视频。然而，现有最先进的模型仍未能将生成的视觉概念与语言中的高层语义（如对象数量、空间关系等）进行准确对齐。我们从多模态数据融合的角度出发，研究不同融合策略对视觉-语言对齐的影响。我们发现，与在预训练图像特征空间中广泛使用的条件文本早期融合相比，一种特殊设计的中间融合能够：(i) 提升文本到图像的对齐质量及生成效果，(ii) 通过减少低秩文本-图像注意力计算，提高训练与推理效率。我们在MS-COCO数据集上通过文本到图像生成任务进行了实验。在U型ViT骨干网络上，我们将提出的中间融合机制与两种常见条件化方法中的经典早期融合机制进行了对比。我们的中间融合模型相较于采用早期融合的强U-ViT基线，实现了更高的CLIP分数和更低的FID，同时计算量（FLOPs）减少20%，训练速度提升50%。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日