SNP-S3: Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks

We present a framework for learning cross-modal video representations by directly pre-training on raw data to facilitate various downstream video-text tasks. Our main contributions lie in the pre-training framework and proxy tasks. First, based on the shortcomings of two mainstream pixel-level pre-training architectures (limited applications or less efficient), we propose Shared Network Pre-training (SNP). By employing one shared BERT-type network to refine textual and cross-modal features simultaneously, SNP is lightweight and could support various downstream applications. Second, based on the intuition that people always pay attention to several "significant words" when understanding a sentence, we propose the Significant Semantic Strengthening (S3) strategy, which includes a novel masking and matching proxy task to promote the pre-training performance. Experiments conducted on three downstream video-text tasks and six datasets demonstrate that, we establish a new state-of-the-art in pixel-level video-text pre-training; we also achieve a satisfactory balance between the pre-training efficiency and the fine-tuning performance. The codebase are available at https://github.com/alipay/Ant-Multi-Modal-Framework/tree/main/prj/snps3_vtp.

翻译：我们提出了一种直接对原始数据进行预训练以学习跨模态视频表示的框架，从而促进多种下游视频-文本任务。本文的主要贡献在于预训练框架和代理任务的设计。首先，针对两种主流像素级预训练架构（应用范围有限或效率较低）的缺陷，我们提出了共享网络预训练（SNP）方法。通过采用单个共享的BERT型网络同时优化文本特征与跨模态特征，SNP具有轻量化的特点，并能够支持多种下游应用。其次，基于人们对句子理解时总会关注若干"显著词汇"的直觉，我们提出了显著语义增强（S3）策略，该策略包含一种新型的掩码与匹配代理任务以提升预训练性能。在三个下游视频-文本任务及六个数据集上的实验表明，我们建立了像素级视频-文本预训练的最新最优结果；同时在预训练效率与微调性能之间取得了令人满意的平衡。代码库已在https://github.com/alipay/Ant-Multi-Modal-Framework/tree/main/prj/snps3_vtp 公开。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日