mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video - 专知论文

会员服务 ·

0

模态 · state-of-the-art · MoDELS · 可理解性 · 视频描述生成（Video Caption） ·

2023 年 2 月 1 日

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

翻译：mPLUG-2：一种跨文本、图像和视频的模块化多模态基础模型

Haiyang Xu,Qinghao Ye,Ming Yan,Yaya Shi,Jiabo Ye,Yuanhong Xu,Chenliang Li,Bin Bi,Qi Qian,Wei Wang,Guohai Xu,Ji Zhang,Songfang Huang,Fei Huang,Jingren Zhou

Recent years have witnessed a big convergence of language, vision, and multi-modal pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized design for multi-modal pretraining, which can benefit from modality collaboration while addressing the problem of modality entanglement. In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement. It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video. Empirical study shows that mPLUG-2 achieves state-of-the-art or competitive results on a broad range of over 30 downstream tasks, spanning multi-modal tasks of image-text and video-text understanding and generation, and uni-modal tasks of text-only, image-only, and video-only understanding. Notably, mPLUG-2 shows new state-of-the-art results of 48.0 top-1 accuracy and 80.3 CIDEr on the challenging MSRVTT video QA and video caption tasks with a far smaller model size and data scale. It also demonstrates strong zero-shot transferability on vision-language and video-language tasks. Code and models will be released in https://github.com/alibaba/AliceMind.

翻译：近年来，语言、视觉与多模态预训练领域呈现出显著的融合趋势。本文提出mPLUG-2，一种采用模块化设计的新统一范式，可受益于模态协作的同时解决模态纠缠问题。与当前主流的纯序列到序列生成或基于编码器的实例判别范式不同，mPLUG-2通过共享通用模块实现模态协作，并解耦不同模态模块以应对模态纠缠，构建了多模块组合网络。该框架可灵活选择不同模块，支持涵盖文本、图像和视频的全模态理解与生成任务。实验表明，mPLUG-2在超过30项下游任务中达到最先进或具有竞争力的结果，涵盖图像-文本与视频-文本的多模态理解/生成任务，以及纯文本、纯图像和纯视频的单模态理解任务。值得注意的是，在具有挑战性的MSRVTT视频问答与视频描述任务中，mPLUG-2以更小的模型规模与数据量取得了48.0的top-1准确率与80.3的CIDEr得分，刷新了最优记录。此外，该模型在视觉-语言与视频-语言任务中展现出强大的零样本迁移能力。代码与模型将发布于https://github.com/alibaba/AliceMind。

2

相关内容

NeurlPS 2022 | 自然语言处理相关论文分类整理

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【DeepMind】基于变换的大规模数据对抗视频预测，Transformation-based Adversarial Video Prediction on Large-Scale Data

【DeepMind】基于变换的大规模数据对抗视频预测，Transformation-based Adversarial Video Prediction on Large-Scale Data

专知会员服务

17+阅读 · 2020年3月9日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

【AAAI2020】多模态注意力语义图嵌入多标签分类（Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification）

【AAAI2020】多模态注意力语义图嵌入多标签分类（Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification）

专知会员服务

92+阅读 · 2019年12月22日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

32+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

图与推荐

2+阅读 · 2022年11月2日

GNN 新基准！Long Range Graph Benchmark

GNN 新基准！Long Range Graph Benchmark

图与推荐

0+阅读 · 2022年10月18日

北京/上海内推 | 索尼中国研究院招聘计算机视觉研究员

北京/上海内推 | 索尼中国研究院招聘计算机视觉研究员

PaperWeekly

0+阅读 · 2022年5月3日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

MoCoGAN 分解运动和内容的视频生成

MoCoGAN 分解运动和内容的视频生成

CreateAMind

18+阅读 · 2017年10月21日

分子团簇负离子束沉积超薄BiSe二维拓扑绝缘体

国家自然科学基金

0+阅读 · 2012年12月31日

microRNA调节肿瘤抑制因子Caliban应答DNA损伤的机制

国家自然科学基金

1+阅读 · 2012年12月31日

胶质瘤表达抗原2（GLEA2)通过ROS-JNK通路对神经胶质瘤杀伤作用的机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

HIV准种变异程度对3TC耐药性产生的影响研究

国家自然科学基金

0+阅读 · 2011年12月31日

Puma和Bim在慢性淋巴细胞白血病细胞凋亡中的作用机制研究

国家自然科学基金

0+阅读 · 2011年12月31日

Wip1对中性粒细胞的负性调节效应及其分子机制

国家自然科学基金

0+阅读 · 2011年12月31日

基于list-mode数据的快速SART真3D PET断层重建算法的研究

国家自然科学基金

0+阅读 · 2011年12月31日

趋化因子CCL2和CX3CL1在泰素诱导触诱发痛中的作用及机制

国家自然科学基金

0+阅读 · 2010年12月31日

基于电磁理论分析与模拟的“#20912;穹A-中山站”#26029;面冰盖内部结构与物性定量表征方法研究

国家自然科学基金

0+阅读 · 2009年12月31日

等离子体惯性效应在行星际磁通量绳结构重建中的作用研究

国家自然科学基金

0+阅读 · 2009年12月31日

ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale

Arxiv

0+阅读 · 2023年3月24日

Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

Arxiv

1+阅读 · 2023年3月23日

Medical diffusion on a budget: textual inversion for medical image generation

Arxiv

0+阅读 · 2023年3月23日

Text with Knowledge Graph Augmented Transformer for Video Captioning

Arxiv

0+阅读 · 2023年3月22日

BiCro: Noisy Correspondence Rectification for Multi-modality Data via Bi-directional Cross-modal Similarity Consistency

Arxiv

0+阅读 · 2023年3月22日

VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval

Arxiv

0+阅读 · 2023年3月22日

Positive-Augmented Constrastive Learning for Image and Video Captioning Evaluation

Arxiv

0+阅读 · 2023年3月21日

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

Arxiv

17+阅读 · 2020年6月2日

Hierarchical Graph Representation Learning with Differentiable Pooling

Hierarchical Graph Representation Learning with Differentiable Pooling

Arxiv

15+阅读 · 2018年6月26日

Video Captioning via Hierarchical Reinforcement Learning

Arxiv

20+阅读 · 2018年3月29日

VIP会员

文章信息

相关主题

state-of-the-art

视频描述生成（Video Caption）

最新内容

《火线上的后勤保障：对抗环境下的随机规划模型研究——俄乌场景案例分析》99页

《火线上的后勤保障：对抗环境下的随机规划模型研究——俄乌场景案例分析》99页

专知会员服务

11+阅读 · 7月16日

《无人地面战车（UGV）的崛起》报告

《无人地面战车（UGV）的崛起》报告

专知会员服务

7+阅读 · 7月16日

《无人机参数化与集群飞行创新项目的监控流程管理：模型、策略及自适应解决方案》

《无人机参数化与集群飞行创新项目的监控流程管理：模型、策略及自适应解决方案》

专知会员服务

6+阅读 · 7月16日

《美军开放式任务系统（OMS）定义与文档（D&D）——Java关键抽象层（CAL）接口生成规范》47页标准

《美军开放式任务系统（OMS）定义与文档（D&D）——Java关键抽象层（CAL）接口生成规范》47页标准

专知会员服务

12+阅读 · 7月16日

美陆军任务式指挥人工智能解决方案

美陆军任务式指挥人工智能解决方案

专知会员服务

11+阅读 · 7月16日

ICML 2026 | 理论级自动形式化：从孤立命题到统一形式化知识库

ICML 2026 | 理论级自动形式化：从孤立命题到统一形式化知识库

专知会员服务

8+阅读 · 7月16日

综述 | 现代智能体自我改进，从模型更新到脚手架演化

综述 | 现代智能体自我改进，从模型更新到脚手架演化

专知会员服务

14+阅读 · 7月16日

美国陆军宣布“项目融合-顶点6”：现代化进程的关键里程碑

美国陆军宣布“项目融合-顶点6”：现代化进程的关键里程碑

专知会员服务

13+阅读 · 7月15日

五角大楼新版反无人机手册：内容解析与战略影响（附手册100页原件）

五角大楼新版反无人机手册：内容解析与战略影响（附手册100页原件）

专知会员服务

16+阅读 · 7月15日

《军事基地能源韧性与经济性权衡评估方法研究》

《军事基地能源韧性与经济性权衡评估方法研究》

专知会员服务

8+阅读 · 7月15日

ACM MM 2026 | UNIT：释放大语言模型在图持续学习中的潜力

ACM MM 2026 | UNIT：释放大语言模型在图持续学习中的潜力

专知会员服务

10+阅读 · 7月15日

综述 | 具身视觉语言导航：系统综述与真实世界评测

综述 | 具身视觉语言导航：系统综述与真实世界评测

专知会员服务

13+阅读 · 7月15日

应对第1、2类无人机威胁的推荐战术、技术与程序

应对第1、2类无人机威胁的推荐战术、技术与程序

专知会员服务

13+阅读 · 7月15日

《反制多无人机集群攻城：序贯斯塔克伯格安全博弈方法研究》59页

《反制多无人机集群攻城：序贯斯塔克伯格安全博弈方法研究》59页

专知会员服务

14+阅读 · 7月15日

博士论文 | 可扩展、自我改进的大语言模型智能体

博士论文 | 可扩展、自我改进的大语言模型智能体

专知会员服务

15+阅读 · 7月14日

相关VIP内容

NeurlPS 2022 | 自然语言处理相关论文分类整理

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【DeepMind】基于变换的大规模数据对抗视频预测，Transformation-based Adversarial Video Prediction on Large-Scale Data

【DeepMind】基于变换的大规模数据对抗视频预测，Transformation-based Adversarial Video Prediction on Large-Scale Data

专知会员服务

17+阅读 · 2020年3月9日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

【AAAI2020】多模态注意力语义图嵌入多标签分类（Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification）

【AAAI2020】多模态注意力语义图嵌入多标签分类（Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification）

专知会员服务

92+阅读 · 2019年12月22日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

32+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《无人地面战车（UGV）的崛起》报告

《美军开放式任务系统（OMS）定义与文档（D&D）——Java关键抽象层（CAL）接口生成规范》47页标准

《火线上的后勤保障：对抗环境下的随机规划模型研究——俄乌场景案例分析》99页

《无人机参数化与集群飞行创新项目的监控流程管理：模型、策略及自适应解决方案》

相关资讯

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

直播 | Interpretable and Trustworthy Graph Geometric Deep Learning

图与推荐

2+阅读 · 2022年11月2日

GNN 新基准！Long Range Graph Benchmark

GNN 新基准！Long Range Graph Benchmark

图与推荐

0+阅读 · 2022年10月18日

北京/上海内推 | 索尼中国研究院招聘计算机视觉研究员

北京/上海内推 | 索尼中国研究院招聘计算机视觉研究员

PaperWeekly

0+阅读 · 2022年5月3日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

MoCoGAN 分解运动和内容的视频生成

MoCoGAN 分解运动和内容的视频生成

CreateAMind

18+阅读 · 2017年10月21日

相关论文

ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale

Arxiv

0+阅读 · 2023年3月24日

Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

Arxiv

1+阅读 · 2023年3月23日

Medical diffusion on a budget: textual inversion for medical image generation

Arxiv

0+阅读 · 2023年3月23日

Text with Knowledge Graph Augmented Transformer for Video Captioning

Arxiv

0+阅读 · 2023年3月22日

BiCro: Noisy Correspondence Rectification for Multi-modality Data via Bi-directional Cross-modal Similarity Consistency

Arxiv

0+阅读 · 2023年3月22日

VoP: Text-Video Co-operative Prompt Tuning for Cross-Modal Retrieval

Arxiv

0+阅读 · 2023年3月22日

Positive-Augmented Constrastive Learning for Image and Video Captioning Evaluation

Arxiv

0+阅读 · 2023年3月21日

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

Arxiv

17+阅读 · 2020年6月2日

Hierarchical Graph Representation Learning with Differentiable Pooling

Hierarchical Graph Representation Learning with Differentiable Pooling

Arxiv

15+阅读 · 2018年6月26日

Video Captioning via Hierarchical Reinforcement Learning

Arxiv

20+阅读 · 2018年3月29日

相关基金

分子团簇负离子束沉积超薄BiSe二维拓扑绝缘体

国家自然科学基金

0+阅读 · 2012年12月31日

microRNA调节肿瘤抑制因子Caliban应答DNA损伤的机制

国家自然科学基金

1+阅读 · 2012年12月31日

胶质瘤表达抗原2（GLEA2)通过ROS-JNK通路对神经胶质瘤杀伤作用的机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

HIV准种变异程度对3TC耐药性产生的影响研究

国家自然科学基金

0+阅读 · 2011年12月31日

Puma和Bim在慢性淋巴细胞白血病细胞凋亡中的作用机制研究

国家自然科学基金

0+阅读 · 2011年12月31日

Wip1对中性粒细胞的负性调节效应及其分子机制

国家自然科学基金

0+阅读 · 2011年12月31日

基于list-mode数据的快速SART真3D PET断层重建算法的研究

国家自然科学基金

0+阅读 · 2011年12月31日

趋化因子CCL2和CX3CL1在泰素诱导触诱发痛中的作用及机制

国家自然科学基金

0+阅读 · 2010年12月31日

基于电磁理论分析与模拟的“#20912;穹A-中山站”#26029;面冰盖内部结构与物性定量表征方法研究

国家自然科学基金

0+阅读 · 2009年12月31日

等离子体惯性效应在行星际磁通量绳结构重建中的作用研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员