Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data

Scaling up weakly-supervised datasets has shown to be highly effective in the image-text domain and has contributed to most of the recent state-of-the-art computer vision and multimodal neural networks. However, existing large-scale video-text datasets and mining techniques suffer from several limitations, such as the scarcity of aligned data, the lack of diversity in the data, and the difficulty of collecting aligned data. Currently popular video-text data mining approach via automatic speech recognition (ASR) used in HowTo100M provides low-quality captions that often do not refer to the video content. Other mining approaches do not provide proper language descriptions (video tags) and are biased toward short clips (alt text). In this work, we show how recent advances in image captioning allow us to pre-train high-quality video models without any parallel video-text data. We pre-train several video captioning models that are based on an OPT language model and a TimeSformer visual backbone. We fine-tune these networks on several video captioning datasets. First, we demonstrate that image captioning pseudolabels work better for pre-training than the existing HowTo100M ASR captions. Second, we show that pre-training on both images and videos produces a significantly better network (+4 CIDER on MSR-VTT) than pre-training on a single modality. Our methods are complementary to the existing pre-training or data mining approaches and can be used in a variety of settings. Given the efficacy of the pseudolabeling method, we are planning to publicly release the generated captions.

翻译：扩展弱监督数据集在图像-文本领域已被证明极为有效，并推动了近期大多数最先进的计算机视觉与多模态神经网络的发展。然而，现有的大规模视频-文本数据集及挖掘技术存在若干局限，例如对齐数据的稀缺性、数据多样性的不足以及收集对齐数据的困难性。目前流行的基于自动语音识别（ASR）的视频-文本数据挖掘方法（如HowTo100M中所用）生成的描述质量较低，往往与视频内容无关。其他挖掘方法无法提供恰当的语言描述（视频标签），且偏向于短视频片段（替代文本）。在本工作中，我们展示了如何利用图像描述的最新进展，在无需任何并行视频-文本数据的情况下预训练高质量视频模型。我们预训练了多个基于OPT语言模型和TimeSformer视觉骨干网络的视频描述模型，并在多个视频描述数据集上对这些网络进行微调。首先，我们证明了图像描述生成的伪标签在预训练效果上优于现有的HowTo100M ASR描述。其次，我们发现同时对图像和视频进行预训练产生的网络性能（在MSR-VTT上CIDER指标提升+4）显著优于单模态预训练。我们的方法与现有预训练或数据挖掘方法互补，可适用于多种场景。鉴于伪标签方法的有效性，我们计划公开发布生成的描述。