Building scalable vision-language models to learn from diverse, multimodal data remains an open challenge. In this paper, we introduce an Efficient Vision-languagE foundation model, namely EVE, which is one unified multimodal Transformer pre-trained solely by one unified pre-training task. Specifically, EVE encodes both vision and language within a shared Transformer network integrated with modality-aware sparse Mixture-of-Experts (MoE) modules, which capture modality-specific information by selectively switching to different experts. To unify pre-training tasks of vision and language, EVE performs masked signal modeling on image-text pairs to reconstruct masked signals, i.e., image pixels and text tokens, given visible signals. This simple yet effective pre-training objective accelerates training by 3.5x compared to the model pre-trained with Image-Text Contrastive and Image-Text Matching losses. Owing to the combination of the unified architecture and pre-training task, EVE is easy to scale up, enabling better downstream performance with fewer resources and faster training speed. Despite its simplicity, EVE achieves state-of-the-art performance on various vision-language downstream tasks, including visual question answering, visual reasoning, and image-text retrieval.
翻译:构建能够从多模态异构数据中学习的可扩展视觉-语言模型仍是一项开放挑战。本文提出高效视觉-语言基础模型EVE——一种仅通过统一预训练任务即可完成训练的多模态Transformer统一架构。具体而言,EVE在共享Transformer网络中集成模态感知的稀疏混合专家模块(MoE),通过选择性激活不同专家以捕获模态特定信息。为统一视觉与语言的预训练任务,EVE对图像-文本对执行掩码信号建模,基于可见信号重建被掩码信号(即图像像素与文本词元)。与采用图像-文本对比损失和图像-文本匹配损失的预训练模型相比,这一简洁高效的预训练目标将训练速度提升3.5倍。得益于统一架构与预训练任务的协同设计,EVE易于扩展,能以更少的资源和更快的训练速度实现更优的下游性能。尽管设计简约,EVE在视觉问答、视觉推理和图像-文本检索等多项视觉-语言下游任务中均达到了最优性能。