MoMo: A shared encoder Model for text, image and multi-Modal representations - 专知论文

会员服务 ·

0

模态 · 多模 · Co-training · 多模态 · 高效性 ·

2023 年 4 月 11 日

MoMo: A shared encoder Model for text, image and multi-Modal representations

翻译：MoMo：面向文本、图像与多模态表示的共享编码器模型

Rakesh Chada,Zhaoheng Zheng,Pradeep Natarajan

We propose a self-supervised shared encoder model that achieves strong results on several visual, language and multimodal benchmarks while being data, memory and run-time efficient. We make three key contributions. First, in contrast to most existing works, we use a single transformer with all the encoder layers processing both the text and the image modalities. Second, we propose a stage-wise training strategy where the model is first trained on images, then jointly with unimodal text and image datasets and finally jointly with text and text-image datasets. Third, to preserve information across both the modalities, we propose a training pipeline that learns simultaneously from gradient updates of different modalities at each training update step. The results on downstream text-only, image-only and multimodal tasks show that our model is competitive with several strong models while using fewer parameters and lesser pre-training data. For example, MoMo performs competitively with FLAVA on multimodal (+3.1), image-only (+1.1) and text-only (-0.1) tasks despite having 2/5th the number of parameters and using 1/3rd the image-text training pairs. Finally, we ablate various design choices and further show that increasing model size produces significant performance gains indicating potential for substantial improvements with larger models using our approach.

翻译：我们提出了一种自监督共享编码器模型，该模型在多个视觉、语言及多模态基准测试中取得了优异结果，同时具备数据、内存和运行时的高效性。本文做出三项关键贡献：第一，与现有大部分工作不同，我们采用单一Transformer架构，其所有编码器层均同时处理文本与图像模态；第二，提出分阶段训练策略，模型先基于图像进行训练，随后联合单模态文本与图像数据集，最终联合文本与文本-图像数据集；第三，为保持跨模态信息，我们设计了一种训练流程，能在每次训练更新步骤中同时学习不同模态的梯度更新。在下游纯文本、纯图像及多模态任务上的结果表明，我们的模型在参数更少、预训练数据更少的情况下，与多个强基线模型具有竞争力。例如，MoMo在多模态（+3.1）、纯图像（+1.1）和纯文本（-0.1）任务上的表现与FLAVA相当，但参数量仅为后者的2/5，且仅使用1/3的图像-文本训练对。最后，我们消融了多种设计选择，并进一步表明模型规模扩大可带来显著的性能提升，预示采用本方法的更大规模模型具有实质性改进潜力。

0

相关内容

大模型如何端边部署？华盛顿Google提出《逐步蒸馏》法，以更少的训练数据和更小的模型规模超越更大的语言模型

大模型如何端边部署？华盛顿Google提出《逐步蒸馏》法，以更少的训练数据和更小的模型规模超越更大的语言模型

专知会员服务

78+阅读 · 2023年5月8日

【ACL2022】一个用于远距监督关系抽取的层级对比学习框架, HiCLRE: A Hierarchical Contrastive Learning Framework for Distantly Supervised Relation Extraction

【ACL2022】一个用于远距监督关系抽取的层级对比学习框架, HiCLRE: A Hierarchical Contrastive Learning Framework for Distantly Supervised Relation Extraction

专知会员服务

15+阅读 · 2022年3月24日

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

专知会员服务

21+阅读 · 2022年3月18日

【CVPR 2022】跨模态检索的协同双流视觉-语言前训练模型，COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

【CVPR 2022】跨模态检索的协同双流视觉-语言前训练模型，COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

专知会员服务

13+阅读 · 2022年3月12日

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

【ICCV2021】多层次对比学习的跨模态检索方法

【ICCV2021】多层次对比学习的跨模态检索方法

专知会员服务

23+阅读 · 2021年10月24日

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

专知会员服务

28+阅读 · 2020年2月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【AAAI2020】多模态注意力语义图嵌入多标签分类（Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification）

【AAAI2020】多模态注意力语义图嵌入多标签分类（Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification）

专知会员服务

92+阅读 · 2019年12月22日

使用 TensorFlow Lite Searcher Library 实现设备端文本到图像搜索

使用 TensorFlow Lite Searcher Library 实现设备端文本到图像搜索

TensorFlow

0+阅读 · 2022年6月1日

Pytorch多模态框架MMF

Pytorch多模态框架MMF

专知

50+阅读 · 2020年6月20日

Multi-Task Learning的几篇综述文章

Multi-Task Learning的几篇综述文章

深度学习自然语言处理

15+阅读 · 2020年6月15日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

基于PyTorch/TorchText的自然语言处理库

基于PyTorch/TorchText的自然语言处理库

专知

28+阅读 · 2019年4月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

vae 相关论文表示学习 1

vae 相关论文表示学习 1

CreateAMind

12+阅读 · 2018年9月6日

【论文推荐】最新八篇生成对抗网络相关论文—条件翻译、RGB-D动作识别、量子生成对抗网络、语义对齐、视频摘要、视觉-文本注意力

【论文推荐】最新八篇生成对抗网络相关论文—条件翻译、RGB-D动作识别、量子生成对抗网络、语义对齐、视频摘要、视觉-文本注意力

专知

15+阅读 · 2018年5月15日

【干货】谷歌一个模型解决所有问题《One Model to Learn Them All》论文深度解读

【干货】谷歌一个模型解决所有问题《One Model to Learn Them All》论文深度解读

专知

10+阅读 · 2018年1月14日

耦合多特征部件的数控装备可靠性建模与评估技术研究

国家自然科学基金

1+阅读 · 2015年12月31日

Mumford-Shah型图像分割问题研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于Realized GARCH框架的波动率和相关性模型理论和应用研究

国家自然科学基金

0+阅读 · 2012年12月31日

图形用户界面效率模型和优化方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

模拟仿真的输入不确定性及其在金融风险管理中的应用

国家自然科学基金

0+阅读 · 2012年12月31日

用于交互式视频检索的教练式主动学习模型

国家自然科学基金

0+阅读 · 2012年12月31日

文本语义模型和子空间聚类研究

国家自然科学基金

1+阅读 · 2009年12月31日

视频选择性注意机理与语义特征提取

国家自然科学基金

1+阅读 · 2009年12月31日

约化群酉表示的branching law及其应用

国家自然科学基金

0+阅读 · 2009年12月31日

多文种文档图像识别的多层次马尔可夫随机场模型研究

国家自然科学基金

1+阅读 · 2008年12月31日

Reconstructing the Mind's Eye: fMRI-to-Image with Contrastive Learning and Diffusion Priors

Arxiv

0+阅读 · 2023年5月29日

A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond

Arxiv

10+阅读 · 2022年7月30日

Cross-Modal Discrete Representation Learning

Arxiv

18+阅读 · 2021年6月10日

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

Arxiv

17+阅读 · 2020年6月2日

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Arxiv

12+阅读 · 2020年2月19日

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Arxiv

19+阅读 · 2020年2月15日

One for All: Neural Joint Modeling of Entities and Events

Arxiv

11+阅读 · 2018年12月1日

Diverse Image-to-Image Translation via Disentangled Representations

Diverse Image-to-Image Translation via Disentangled Representations

Arxiv

13+阅读 · 2018年8月2日

Multimodal Sentiment Analysis using Hierarchical Fusion with Context Modeling

Arxiv

11+阅读 · 2018年6月16日

End-to-End Multi-Task Learning with Attention

Arxiv

19+阅读 · 2018年3月28日

VIP会员

文章信息

相关主题

最新内容

ICML 2026 | 自回归Boltzmann生成器重塑分子采样

ICML 2026 | 自回归Boltzmann生成器重塑分子采样

专知会员服务

0+阅读 · 今天15:55

GNN跨域综述：从消息传递到图基础模型

GNN跨域综述：从消息传递到图基础模型

专知会员服务

0+阅读 · 今天15:53

无人机自主控制与人工智能：系统性综述

无人机自主控制与人工智能：系统性综述

专知会员服务

11+阅读 · 今天7:25

巡飞弹与反无人机系统——现代战场的两大支柱

巡飞弹与反无人机系统——现代战场的两大支柱

专知会员服务

3+阅读 · 今天6:54

《打造“黄金舰队”》57页报告

《打造“黄金舰队”》57页报告

专知会员服务

3+阅读 · 今天6:52

《北约数字教官网络发展路径》128页报告

《北约数字教官网络发展路径》128页报告

专知会员服务

2+阅读 · 今天6:33

ECCV 2026 | MIMFlow：MIM与归一化流统一图像生成

ECCV 2026 | MIMFlow：MIM与归一化流统一图像生成

专知会员服务

7+阅读 · 6月25日

超越自回归边界：扩散模型、世界模型与SSM如何重塑代码智能

超越自回归边界：扩散模型、世界模型与SSM如何重塑代码智能

专知会员服务

6+阅读 · 6月25日

重塑决策优势：美军作战艺术与多域作战中联盟联合全域指挥控制（CJADC2）体系的融合

重塑决策优势：美军作战艺术与多域作战中联盟联合全域指挥控制（CJADC2）体系的融合

专知会员服务

10+阅读 · 6月25日

网状网络及其在军事领域的运用

网状网络及其在军事领域的运用

专知会员服务

8+阅读 · 6月25日

《意识即战场——全球安全体系中认知战的演进：乌克兰构建认知作战体系的展望》

《意识即战场——全球安全体系中认知战的演进：乌克兰构建认知作战体系的展望》

专知会员服务

8+阅读 · 6月25日

无美国参与的欧洲战争方式（万字长文）

无美国参与的欧洲战争方式（万字长文）

专知会员服务

8+阅读 · 6月25日

重构“下一场战争”的制胜理论：超越兰彻斯特方程与现代系统

重构“下一场战争”的制胜理论：超越兰彻斯特方程与现代系统

专知会员服务

10+阅读 · 6月25日

《国防工业中基于模型定义的实施：产品定义数字化转型的战略路径》90页

《国防工业中基于模型定义的实施：产品定义数字化转型的战略路径》90页

专知会员服务

9+阅读 · 6月25日

《国防领域敏感性分析白皮书》

《国防领域敏感性分析白皮书》

专知会员服务

9+阅读 · 6月25日

相关VIP内容

大模型如何端边部署？华盛顿Google提出《逐步蒸馏》法，以更少的训练数据和更小的模型规模超越更大的语言模型

大模型如何端边部署？华盛顿Google提出《逐步蒸馏》法，以更少的训练数据和更小的模型规模超越更大的语言模型

专知会员服务

78+阅读 · 2023年5月8日

【ACL2022】一个用于远距监督关系抽取的层级对比学习框架, HiCLRE: A Hierarchical Contrastive Learning Framework for Distantly Supervised Relation Extraction

【ACL2022】一个用于远距监督关系抽取的层级对比学习框架, HiCLRE: A Hierarchical Contrastive Learning Framework for Distantly Supervised Relation Extraction

专知会员服务

15+阅读 · 2022年3月24日

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

专知会员服务

21+阅读 · 2022年3月18日

【CVPR 2022】跨模态检索的协同双流视觉-语言前训练模型，COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

【CVPR 2022】跨模态检索的协同双流视觉-语言前训练模型，COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

专知会员服务

13+阅读 · 2022年3月12日

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

【ICCV2021】多层次对比学习的跨模态检索方法

【ICCV2021】多层次对比学习的跨模态检索方法

专知会员服务

23+阅读 · 2021年10月24日

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

【Google ICLR2020论文】嵌入式大规模检索的预训练任务，Pre-training Tasks for Embedding-based Large-scale Retrieval

专知会员服务

28+阅读 · 2020年2月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【AAAI2020】多模态注意力语义图嵌入多标签分类（Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification）

【AAAI2020】多模态注意力语义图嵌入多标签分类（Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification）

专知会员服务

92+阅读 · 2019年12月22日

热门VIP内容

开通专知VIP会员享更多权益服务

GNN跨域综述：从消息传递到图基础模型

巡飞弹与反无人机系统——现代战场的两大支柱

ICML 2026 | 自回归Boltzmann生成器重塑分子采样

无人机自主控制与人工智能：系统性综述

相关资讯

使用 TensorFlow Lite Searcher Library 实现设备端文本到图像搜索

使用 TensorFlow Lite Searcher Library 实现设备端文本到图像搜索

TensorFlow

0+阅读 · 2022年6月1日

Pytorch多模态框架MMF

Pytorch多模态框架MMF

专知

50+阅读 · 2020年6月20日

Multi-Task Learning的几篇综述文章

Multi-Task Learning的几篇综述文章

深度学习自然语言处理

15+阅读 · 2020年6月15日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

基于PyTorch/TorchText的自然语言处理库

基于PyTorch/TorchText的自然语言处理库

专知

28+阅读 · 2019年4月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

vae 相关论文表示学习 1

vae 相关论文表示学习 1

CreateAMind

12+阅读 · 2018年9月6日

【论文推荐】最新八篇生成对抗网络相关论文—条件翻译、RGB-D动作识别、量子生成对抗网络、语义对齐、视频摘要、视觉-文本注意力

【论文推荐】最新八篇生成对抗网络相关论文—条件翻译、RGB-D动作识别、量子生成对抗网络、语义对齐、视频摘要、视觉-文本注意力

专知

15+阅读 · 2018年5月15日

【干货】谷歌一个模型解决所有问题《One Model to Learn Them All》论文深度解读

【干货】谷歌一个模型解决所有问题《One Model to Learn Them All》论文深度解读

专知

10+阅读 · 2018年1月14日

相关论文

Reconstructing the Mind's Eye: fMRI-to-Image with Contrastive Learning and Diffusion Priors

Arxiv

0+阅读 · 2023年5月29日

A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond

Arxiv

10+阅读 · 2022年7月30日

Cross-Modal Discrete Representation Learning

Arxiv

18+阅读 · 2021年6月10日

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

Arxiv

17+阅读 · 2020年6月2日

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Arxiv

12+阅读 · 2020年2月19日

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Arxiv

19+阅读 · 2020年2月15日

One for All: Neural Joint Modeling of Entities and Events

Arxiv

11+阅读 · 2018年12月1日

Diverse Image-to-Image Translation via Disentangled Representations

Diverse Image-to-Image Translation via Disentangled Representations

Arxiv

13+阅读 · 2018年8月2日

Multimodal Sentiment Analysis using Hierarchical Fusion with Context Modeling

Arxiv

11+阅读 · 2018年6月16日

End-to-End Multi-Task Learning with Attention

Arxiv

19+阅读 · 2018年3月28日

相关基金

耦合多特征部件的数控装备可靠性建模与评估技术研究

国家自然科学基金

1+阅读 · 2015年12月31日

Mumford-Shah型图像分割问题研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于Realized GARCH框架的波动率和相关性模型理论和应用研究

国家自然科学基金

0+阅读 · 2012年12月31日

图形用户界面效率模型和优化方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

模拟仿真的输入不确定性及其在金融风险管理中的应用

国家自然科学基金

0+阅读 · 2012年12月31日

用于交互式视频检索的教练式主动学习模型

国家自然科学基金

0+阅读 · 2012年12月31日

文本语义模型和子空间聚类研究

国家自然科学基金

1+阅读 · 2009年12月31日

视频选择性注意机理与语义特征提取

国家自然科学基金

1+阅读 · 2009年12月31日

约化群酉表示的branching law及其应用

国家自然科学基金

0+阅读 · 2009年12月31日

多文种文档图像识别的多层次马尔可夫随机场模型研究

国家自然科学基金

1+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员