OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation - 专知论文

会员服务 ·

0

可理解性 · 学成 · MoDELS · 相关系数 · 模态 ·

2021 年 7 月 6 日

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

翻译：OPM: 跨模式理解和生成的Omni-Pervication预培训师

Jing Liu,Xinxin Zhu,Fei Liu,Longteng Guo,Zijia Zhao,Mingzhen Sun,Weining Wang,Hanqing Lu,Shiyu Zhou,Jiajun Zhang,Jinqiao Wang

In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources. OPT is constructed in an encoder-decoder framework, including three single-modal encoders to generate token-based embeddings for each modality, a cross-modal encoder to encode the correlations among the three modalities, and two cross-modal decoders to generate text and image respectively. For the OPT's pre-training, we design a multi-task pretext learning scheme to model multi-modal resources from three different data granularities, \ie, token-, modality-, and sample-level modeling, through which OPT learns to align and translate among different modalities. The pre-training task is carried out on a large amount of image-text-audio triplets from Open Images. Experimental results show that OPT can learn strong image-text-audio multi-modal representations and achieve promising results on a variety of cross-modal understanding and generation tasks.

翻译：在本文中,我们建议用视觉、文字和音频资源联合建模,进行跨模式理解和生成的Omni-eption Pre-Trainer(OPT),以进行跨模式理解和生成;在编解码器-解码器框架中,包括三个单一模式编码器,以生成每种模式的象征性嵌入器;一个交叉模式编码器,以编码三种模式之间的相互关系;以及两个交叉模式解码器,分别生成文本和图像;在编解训练前,我们设计了一个多任务托辞学习计划,以建模来自三种不同数据颗粒的多模式资源,即:\ie、象征性、模式和样本级的模型,通过这些模型,方阵列方学会在不同模式之间进行协调和翻译;培训前的任务是用大量开放图像-文字三重来完成;实验结果显示,被占领土可以学习强大的图像-文字-多模式的多模式展示,并在各种跨模式和一代任务上取得有希望的成果。

0

相关内容

可理解性

对比学习简述

专知会员服务

90+阅读 · 2021年6月29日

零样本文本分类，Zero-Shot Learning for Text Classification

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

【异构图迁移的零样本学习】Heterogeneous Graph-based Knowledge Transfer for Generalized Zero-shot Learning

【异构图迁移的零样本学习】Heterogeneous Graph-based Knowledge Transfer for Generalized Zero-shot Learning

专知会员服务

66+阅读 · 2020年4月17日

【ACL2020-Google】学习鲁棒度量的文本生成，BLEURT: Learning Robust Metrics for Text Generation

【ACL2020-Google】学习鲁棒度量的文本生成，BLEURT: Learning Robust Metrics for Text Generation

专知会员服务

17+阅读 · 2020年4月10日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【图像分割| 2019最新综述】理解图像分割的深度学习技术，附58页PDF（Understanding Deep Learning Techniques for Image Segmentation）

【图像分割| 2019最新综述】理解图像分割的深度学习技术，附58页PDF（Understanding Deep Learning Techniques for Image Segmentation）

专知会员服务

59+阅读 · 2019年11月16日

【中科院计算所 | 文献综述】自然语言生成的无监督前训练:文献综述，Unsupervised Pre-training for Natural Language Generation: A Literature Review

【中科院计算所 | 文献综述】自然语言生成的无监督前训练:文献综述，Unsupervised Pre-training for Natural Language Generation: A Literature Review

专知会员服务

49+阅读 · 2019年11月15日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

Multi-Task Learning的几篇综述文章

Multi-Task Learning的几篇综述文章

深度学习自然语言处理

15+阅读 · 2020年6月15日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

CVPR2019| 04-03更新10篇论文及代码（3篇oral、含GAN、文本图像生成等）

CVPR2019| 04-03更新10篇论文及代码（3篇oral、含GAN、文本图像生成等）

极市平台

18+阅读 · 2019年4月3日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

最新5篇生成对抗网络相关论文推荐—FusedGAN、DeblurGAN、AdvGAN、CipherGAN、MMD GANS

最新5篇生成对抗网络相关论文推荐—FusedGAN、DeblurGAN、AdvGAN、CipherGAN、MMD GANS

专知

23+阅读 · 2018年1月18日

gan生成图像at 1024² 的代码论文

gan生成图像at 1024² 的代码论文

CreateAMind

4+阅读 · 2017年10月31日

MoCoGAN 分解运动和内容的视频生成

MoCoGAN 分解运动和内容的视频生成

CreateAMind

18+阅读 · 2017年10月21日

Multimodal Conditionality for Natural Language Generation

Arxiv

1+阅读 · 2021年9月2日

WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

Arxiv

6+阅读 · 2021年3月17日

Self-Supervised Learning by Cross-Modal Audio-Video Clustering

Arxiv

6+阅读 · 2020年10月26日

Query Understanding via Intent Description Generation

Arxiv

9+阅读 · 2020年8月25日

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Arxiv

19+阅读 · 2020年2月15日

Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions

Arxiv

4+阅读 · 2018年11月26日

Predicting Visual Features from Text for Image and Video Caption Retrieval

Arxiv

5+阅读 · 2018年7月14日

Controllable Generative Adversarial Network

Arxiv

5+阅读 · 2018年5月1日

DeSIGN: Design Inspiration from Generative Networks

Arxiv

3+阅读 · 2018年4月3日

An Improved Evaluation Framework for Generative Adversarial Networks

Arxiv

3+阅读 · 2018年3月27日

VIP会员

文章信息

相关主题

最新内容

2025年大语言模型进展报告

2025年大语言模型进展报告

专知会员服务

1+阅读 · 今天13:30

多智能体协作机制

多智能体协作机制

专知会员服务

1+阅读 · 今天13:26

非对称优势：美海军开发低成本反无人机技术

非对称优势：美海军开发低成本反无人机技术

专知会员服务

4+阅读 · 今天4:39

《反无人机技术领域的技术发展综述：C-UAS探测、跟踪与识别技术》80页报告

《反无人机技术领域的技术发展综述：C-UAS探测、跟踪与识别技术》80页报告

专知会员服务

14+阅读 · 今天2:52

《美战争部小企业创新研究（SBIR）计划》

《美战争部小企业创新研究（SBIR）计划》

专知会员服务

6+阅读 · 今天2:48

《军事模拟：将军事条令与目标融入AI智能体》

《军事模拟：将军事条令与目标融入AI智能体》

专知会员服务

9+阅读 · 今天2:43

【NTU博士论文】3D人体动作生成

【NTU博士论文】3D人体动作生成

专知会员服务

7+阅读 · 4月24日

DeepSeek-V4：百万 Token 上下文背后，大模型正在进入“长程智能”时代（附中英文pdf版）

DeepSeek-V4：百万 Token 上下文背后，大模型正在进入“长程智能”时代（附中英文pdf版）

专知会员服务

8+阅读 · 4月24日

以色列军事技术对美国军力发展的持续性赋能

以色列军事技术对美国军力发展的持续性赋能

专知会员服务

8+阅读 · 4月24日

战场之外的较量：美伊冲突中的认知战与心理博弈

战场之外的较量：美伊冲突中的认知战与心理博弈

专知会员服务

6+阅读 · 4月24日

俄乌战争中乌克兰防空能力演变与见解（中文版）

俄乌战争中乌克兰防空能力演变与见解（中文版）

专知会员服务

7+阅读 · 4月24日

《面向巡飞弹药系统的情境感知深度强化学习自主非线性机动控制》

《面向巡飞弹药系统的情境感知深度强化学习自主非线性机动控制》

专知会员服务

10+阅读 · 4月24日

《深度强化学习在兵棋推演中的应用》40页报告

《深度强化学习在兵棋推演中的应用》40页报告

专知会员服务

14+阅读 · 4月24日

《多域作战面临复杂现实》

《多域作战面临复杂现实》

专知会员服务

10+阅读 · 4月24日

《印度的多域作战：条令与能力发展》报告

《印度的多域作战：条令与能力发展》报告

专知会员服务

5+阅读 · 4月24日

相关VIP内容

对比学习简述

专知会员服务

90+阅读 · 2021年6月29日

零样本文本分类，Zero-Shot Learning for Text Classification

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

【异构图迁移的零样本学习】Heterogeneous Graph-based Knowledge Transfer for Generalized Zero-shot Learning

【异构图迁移的零样本学习】Heterogeneous Graph-based Knowledge Transfer for Generalized Zero-shot Learning

专知会员服务

66+阅读 · 2020年4月17日

【ACL2020-Google】学习鲁棒度量的文本生成，BLEURT: Learning Robust Metrics for Text Generation

【ACL2020-Google】学习鲁棒度量的文本生成，BLEURT: Learning Robust Metrics for Text Generation

专知会员服务

17+阅读 · 2020年4月10日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【图像分割| 2019最新综述】理解图像分割的深度学习技术，附58页PDF（Understanding Deep Learning Techniques for Image Segmentation）

【图像分割| 2019最新综述】理解图像分割的深度学习技术，附58页PDF（Understanding Deep Learning Techniques for Image Segmentation）

专知会员服务

59+阅读 · 2019年11月16日

【中科院计算所 | 文献综述】自然语言生成的无监督前训练:文献综述，Unsupervised Pre-training for Natural Language Generation: A Literature Review

【中科院计算所 | 文献综述】自然语言生成的无监督前训练:文献综述，Unsupervised Pre-training for Natural Language Generation: A Literature Review

专知会员服务

49+阅读 · 2019年11月15日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

多智能体协作机制

《反无人机技术领域的技术发展综述：C-UAS探测、跟踪与识别技术》80页报告

2025年大语言模型进展报告

非对称优势：美海军开发低成本反无人机技术

相关资讯

Multi-Task Learning的几篇综述文章

Multi-Task Learning的几篇综述文章

深度学习自然语言处理

15+阅读 · 2020年6月15日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

CVPR2019| 04-03更新10篇论文及代码（3篇oral、含GAN、文本图像生成等）

CVPR2019| 04-03更新10篇论文及代码（3篇oral、含GAN、文本图像生成等）

极市平台

18+阅读 · 2019年4月3日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

【论文推荐】最新5篇图像描述生成（Image Caption）相关论文—情感、注意力机制、遥感图像、序列到序列、深度神经结构

专知

66+阅读 · 2018年1月31日

最新5篇生成对抗网络相关论文推荐—FusedGAN、DeblurGAN、AdvGAN、CipherGAN、MMD GANS

最新5篇生成对抗网络相关论文推荐—FusedGAN、DeblurGAN、AdvGAN、CipherGAN、MMD GANS

专知

23+阅读 · 2018年1月18日

gan生成图像at 1024² 的代码论文

gan生成图像at 1024² 的代码论文

CreateAMind

4+阅读 · 2017年10月31日

MoCoGAN 分解运动和内容的视频生成

MoCoGAN 分解运动和内容的视频生成

CreateAMind

18+阅读 · 2017年10月21日

相关论文

Multimodal Conditionality for Natural Language Generation

Arxiv

1+阅读 · 2021年9月2日

WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

Arxiv

6+阅读 · 2021年3月17日

Self-Supervised Learning by Cross-Modal Audio-Video Clustering

Arxiv

6+阅读 · 2020年10月26日

Query Understanding via Intent Description Generation

Arxiv

9+阅读 · 2020年8月25日

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Arxiv

19+阅读 · 2020年2月15日

Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions

Arxiv

4+阅读 · 2018年11月26日

Predicting Visual Features from Text for Image and Video Caption Retrieval

Arxiv

5+阅读 · 2018年7月14日

Controllable Generative Adversarial Network

Arxiv

5+阅读 · 2018年5月1日

DeSIGN: Design Inspiration from Generative Networks

Arxiv

3+阅读 · 2018年4月3日

An Improved Evaluation Framework for Generative Adversarial Networks

Arxiv

3+阅读 · 2018年3月27日

微信扫码咨询专知VIP会员