Recent years have witnessed a big convergence of language, vision, and multi-modal pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized design for multi-modal pretraining, which can benefit from modality collaboration while addressing the problem of modality entanglement. In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement. It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video. Empirical study shows that mPLUG-2 achieves state-of-the-art or competitive results on a broad range of over 30 downstream tasks, spanning multi-modal tasks of image-text and video-text understanding and generation, and uni-modal tasks of text-only, image-only, and video-only understanding. Notably, mPLUG-2 shows new state-of-the-art results of 48.0 top-1 accuracy and 80.3 CIDEr on the challenging MSRVTT video QA and video caption tasks with a far smaller model size and data scale. It also demonstrates strong zero-shot transferability on vision-language and video-language tasks. Code and models will be released in https://github.com/alibaba/AliceMind.
翻译:近年来,语言、视觉与多模态预训练领域呈现出显著的融合趋势。本文提出mPLUG-2,一种采用模块化设计的新统一范式,可受益于模态协作的同时解决模态纠缠问题。与当前主流的纯序列到序列生成或基于编码器的实例判别范式不同,mPLUG-2通过共享通用模块实现模态协作,并解耦不同模态模块以应对模态纠缠,构建了多模块组合网络。该框架可灵活选择不同模块,支持涵盖文本、图像和视频的全模态理解与生成任务。实验表明,mPLUG-2在超过30项下游任务中达到最先进或具有竞争力的结果,涵盖图像-文本与视频-文本的多模态理解/生成任务,以及纯文本、纯图像和纯视频的单模态理解任务。值得注意的是,在具有挑战性的MSRVTT视频问答与视频描述任务中,mPLUG-2以更小的模型规模与数据量取得了48.0的top-1准确率与80.3的CIDEr得分,刷新了最优记录。此外,该模型在视觉-语言与视频-语言任务中展现出强大的零样本迁移能力。代码与模型将发布于https://github.com/alibaba/AliceMind。