Cross-modal learning of video and text plays a key role in Video Question Answering (VideoQA). In this paper, we propose a visual-text attention mechanism to utilize the Contrastive Language-Image Pre-training (CLIP) trained on lots of general domain language-image pairs to guide the cross-modal learning for VideoQA. Specifically, we first extract video features using a TimeSformer and text features using a BERT from the target application domain, and utilize CLIP to extract a pair of visual-text features from the general-knowledge domain through the domain-specific learning. We then propose a Cross-domain Learning to extract the attention information between visual and linguistic features across the target domain and general domain. The set of CLIP-guided visual-text features are integrated to predict the answer. The proposed method is evaluated on MSVD-QA and MSRVTT-QA datasets, and outperforms state-of-the-art methods.
翻译:视频与文本的跨模态学习在视频问答(VideoQA)中起着关键作用。本文提出一种视觉-文本注意力机制,利用在大量通用领域语言-图像对上训练的对比语言-图像预训练(CLIP)模型来引导VideoQA的跨模态学习。具体而言,我们首先使用TimeSformer提取目标应用领域的视频特征,使用BERT提取文本特征,并通过领域特定学习利用CLIP从通用知识领域提取一组视觉-文本特征。随后,我们提出跨领域学习方法来提取目标领域与通用领域之间视觉特征与语言特征的注意力信息。融合这组CLIP引导的视觉-文本特征以预测答案。该方法在MSVD-QA和MSRVTT-QA数据集上进行了评估,性能优于现有最佳方法。