Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval

Multi-channel video-language retrieval require models to understand information from different channels (e.g. video$+$question, video$+$speech) to correctly link a video with a textual response or query. Fortunately, contrastive multimodal models have been shown to be highly effective at aligning entities in images/videos and text, e.g., CLIP; text contrastive models have been extensively studied recently for their strong ability of producing discriminative sentence embeddings, e.g., SimCSE. Their abilities are exactly needed by multi-channel video-language retrieval. However, there is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources. In this paper, we identify a principled model design space with two axes: how to represent videos and how to fuse video and text information. Based on categorization of recent methods, we investigate the options of representing videos using continuous feature vectors or discrete text tokens; for the fusion method, we explore the use of a multimodal transformer or a pretrained contrastive text model. We extensively evaluate the four combinations on five video-language datasets. We surprisingly find that discrete text tokens coupled with a pretrained contrastive text model yields the best performance, which can even outperform state-of-the-art on the iVQA and How2QA datasets without the additional training on millions of video-language data. Further analysis shows that this is because representing videos as text tokens captures the key visual information with text tokens that are naturally aligned with text models and the text models are strong multimodal retriever after the contrastive pretraining process.

翻译：多通道视频-语言检索要求模型理解不同通道（如视频+问题、视频+语音）的信息，从而正确关联视频与文本响应或查询。幸运的是，对比多模态模型已被证明在图像/视频与文本的对齐方面极为有效（例如CLIP）；文本对比模型因其生成判别性句子嵌入的强大能力（如SimCSE）近期亦受到广泛研究。这些能力正是多通道视频-语言检索所需的核心要素。然而，目前尚未有明确方法能在有限数据与资源条件下，快速将这两类模型适配至多通道视频-语言检索任务。本文从两个维度识别出具有原则性的模型设计空间：如何表示视频，以及如何融合视频与文本信息。基于近期研究方法的分类，我们探讨了使用连续特征向量或离散文本令牌表示视频的方案；在融合方法上，我们探索了多模态Transformer或预训练对比文本模型的应用。我们在五个视频语言数据集上全面评估了四种组合方案。令人意外地发现，离散文本令牌与预训练对比文本模型的组合性能最优，甚至在无需百万级视频语言数据额外训练的情况下，即可在iVQA和How2QA数据集上超越现有最优方法。进一步分析表明，这是由于以文本令牌表示视频能通过自然对齐文本模型的令牌捕获关键视觉信息，且文本模型在对比预训练后已成为强大的多模态检索器。