Masked visual modeling (MVM) has been recently proven effective for visual pre-training. While similar reconstructive objectives on video inputs (e.g., masked frame modeling) have been explored in video-language (VidL) pre-training, previous studies fail to find a truly effective MVM strategy that can largely benefit the downstream performance. In this work, we systematically examine the potential of MVM in the context of VidL learning. Specifically, we base our study on a fully end-to-end VIdeO-LanguagE Transformer (VIOLET), where the supervision from MVM training can be backpropagated to the video pixel space. In total, eight different reconstructive targets of MVM are explored, from low-level pixel values and oriented gradients to high-level depth maps, optical flow, discrete visual tokens, and latent visual features. We conduct comprehensive experiments and provide insights into the factors leading to effective MVM training, resulting in an enhanced model VIOLETv2. Empirically, we show VIOLETv2 pre-trained with MVM objective achieves notable improvements on 13 VidL benchmarks, ranging from video question answering, video captioning, to text-to-video retrieval.
翻译:掩码视觉建模(MVM)近期已被证明在视觉预训练中有效。尽管视频语言(VidL)预训练中已探索了类似的重建性目标(如掩码帧建模),但先前研究未能找到能够显著提升下游性能的真正有效的MVM策略。本研究系统性地考察了MVM在VidL学习中的潜力。具体而言,我们基于完全端到端的视频语言Transformer(VIOLET)展开研究,其中MVM训练的监督信号可反向传播至视频像素空间。我们总共探索了八种不同的MVM重建目标,涵盖从低级像素值与方向梯度到高级深度图、光流、离散视觉标记及潜在视觉特征。通过全面实验,我们揭示了实现有效MVM训练的关键因素,并由此得到增强型模型VIOLETv2。实验结果表明,经MVM目标预训练的VIOLETv2在13个VidL基准测试中取得显著提升,涵盖视频问答、视频字幕生成及文本到视频检索等任务。