We present a framework for learning cross-modal video representations by directly pre-training on raw data to facilitate various downstream video-text tasks. Our main contributions lie in the pre-training framework and proxy tasks. First, based on the shortcomings of two mainstream pixel-level pre-training architectures (limited applications or less efficient), we propose Shared Network Pre-training (SNP). By employing one shared BERT-type network to refine textual and cross-modal features simultaneously, SNP is lightweight and could support various downstream applications. Second, based on the intuition that people always pay attention to several "significant words" when understanding a sentence, we propose the Significant Semantic Strengthening (S3) strategy, which includes a novel masking and matching proxy task to promote the pre-training performance. Experiments conducted on three downstream video-text tasks and six datasets demonstrate that, we establish a new state-of-the-art in pixel-level video-text pre-training; we also achieve a satisfactory balance between the pre-training efficiency and the fine-tuning performance. The codebase are available at https://github.com/alipay/Ant-Multi-Modal-Framework/tree/main/prj/snps3_vtp.
翻译:我们提出了一种直接对原始数据进行预训练以学习跨模态视频表示的框架,从而促进多种下游视频-文本任务。本文的主要贡献在于预训练框架和代理任务的设计。首先,针对两种主流像素级预训练架构(应用范围有限或效率较低)的缺陷,我们提出了共享网络预训练(SNP)方法。通过采用单个共享的BERT型网络同时优化文本特征与跨模态特征,SNP具有轻量化的特点,并能够支持多种下游应用。其次,基于人们对句子理解时总会关注若干"显著词汇"的直觉,我们提出了显著语义增强(S3)策略,该策略包含一种新型的掩码与匹配代理任务以提升预训练性能。在三个下游视频-文本任务及六个数据集上的实验表明,我们建立了像素级视频-文本预训练的最新最优结果;同时在预训练效率与微调性能之间取得了令人满意的平衡。代码库已在https://github.com/alipay/Ant-Multi-Modal-Framework/tree/main/prj/snps3_vtp 公开。