Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning

Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks. Previous studies of video-language pretraining mainly focus on short-form videos (i.e., within 30 seconds) and sentences, leaving long-form video-language pre-training rarely explored. Directly learning representation from long-form videos and language may benefit many long-form video-language understanding tasks. However, it is challenging due to the difficulty of modeling long-range relationships and the heavy computational burden caused by more frames. In this paper, we introduce a Long-Form VIdeo-LAnguage pre-training model (LF-VILA) and train it on a large-scale long-form video and paragraph dataset constructed from an existing public dataset. To effectively capture the rich temporal dynamics and to better align video and language in an efficient end-to-end manner, we introduce two novel designs in our LF-VILA model. We first propose a Multimodal Temporal Contrastive (MTC) loss to learn the temporal relation across different modalities by encouraging fine-grained alignment between long-form videos and paragraphs. Second, we propose a Hierarchical Temporal Window Attention (HTWA) mechanism to effectively capture long-range dependency while reducing computational cost in Transformer. We fine-tune the pre-trained LF-VILA model on seven downstream long-form video-language understanding tasks of paragraph-to-video retrieval and long-form video question-answering, and achieve new state-of-the-art performances. Specifically, our model achieves 16.1% relative improvement on ActivityNet paragraph-to-video retrieval task and 2.4% on How2QA task, respectively. We release our code, dataset, and pre-trained models at https://github.com/microsoft/XPretrain.

翻译：大规模视频-语言预训练在视频-语言理解任务中已展现出显著提升。现有视频-语言预训练研究主要聚焦于短视频（即30秒以内）及短句，而长篇视频-语言预训练鲜有探索。直接从长篇视频与语言中学习表征，或可裨益诸多长篇视频-语言理解任务。然而，因长程关系建模困难及多帧数带来的巨大计算负担，该方向颇具挑战。本文提出长篇视频-语言预训练模型（LF-VILA），并基于现有公开数据集构建的大规模长篇视频-段落数据集对其进行训练。为有效捕捉丰富的时间动态，并以高效端到端方式更好对齐视频与语言，我们在LF-VILA模型中引入两项创新设计。首先，提出多模态时序对比（MTC）损失，通过促进长篇视频与段落间的细粒度对齐，学习跨模态的时序关联。其次，提出层级时序窗口注意力（HTWA）机制，在降低Transformer计算成本的同时，有效捕捉长程依赖。我们将预训练的LF-VILA模型在段落到视频检索与长篇视频问答等七项下游长篇视频-语言理解任务上进行微调，并取得最新最优性能。具体而言，在ActivityNet段落到视频检索任务上相对提升16.1%，在How2QA任务上提升2.4%。我们的代码、数据集及预训练模型已开源至https://github.com/microsoft/XPretrain。