Vision language pre-training aims to learn alignments between vision and language from a large amount of data. Most existing methods only learn image-text alignments. Some others utilize pre-trained object detectors to leverage vision language alignments at the object level. In this paper, we propose to learn multi-grained vision language alignments by a unified pre-training framework that learns multi-grained aligning and multi-grained localization simultaneously. Based on it, we present X$^2$-VLM, an all-in-one model with a flexible modular architecture, in which we further unify image-text pre-training and video-text pre-training in one model. X$^2$-VLM is able to learn unlimited visual concepts associated with diverse text descriptions. Experiment results show that X$^2$-VLM performs the best on base and large scale for both image-text and video-text tasks, making a good trade-off between performance and model scale. Moreover, we show that the modular design of X$^2$-VLM results in high transferability for it to be utilized in any language or domain. For example, by simply replacing the text encoder with XLM-R, X$^2$-VLM outperforms state-of-the-art multilingual multi-modal pre-trained models without any multilingual pre-training. The code and pre-trained models are available at https://github.com/zengyan-97/X2-VLM.
翻译:视觉语言预训练旨在从大量数据中学习视觉与语言之间的对齐关系。现有方法大多仅学习图像-文本对齐,部分方法利用预训练的目标检测器实现物体级别的视觉语言对齐。本文提出通过统一的预训练框架同时学习多粒度对齐与多粒度定位,从而学习多粒度视觉语言对齐。在此基础上,我们提出了X$^2$-VLM——一种具有灵活模块化架构的全能模型,进一步将图像-文本预训练与视频-文本预训练统一于单一模型中。X$^2$-VLM能够学习与多样化文本描述相关联的无限视觉概念。实验结果表明,X$^2$-VLM在基础与大规模场景下均取得了图像-文本和视频-文本任务的最优性能,在模型规模与性能之间实现了良好平衡。此外,我们展示了X$^2$-VLM的模块化设计具有高度可迁移性,可广泛应用于任何语言或领域。例如,仅需将文本编码器替换为XLM-R,X$^2\)-VLM即可在无需任何多语言预训练的情况下超越现有最先进的多语言多模态预训练模型。代码与预训练模型已开源至https://github.com/zengyan-97/X2-VLM。