In text-video retrieval, recent works have benefited from the powerful learning capabilities of pre-trained text-image foundation models (e.g., CLIP) by adapting them to the video domain. A critical problem for them is how to effectively capture the rich semantics inside the video using the image encoder of CLIP. To tackle this, state-of-the-art methods adopt complex cross-modal modeling techniques to fuse the text information into video frame representations, which, however, incurs severe efficiency issues in large-scale retrieval systems as the video representations must be recomputed online for every text query. In this paper, we discard this problematic cross-modal fusion process and aim to learn semantically-enhanced representations purely from the video, so that the video representations can be computed offline and reused for different texts. Concretely, we first introduce a spatial-temporal "Prompt Cube" into the CLIP image encoder and iteratively switch it within the encoder layers to efficiently incorporate the global video semantics into frame representations. We then propose to apply an auxiliary video captioning objective to train the frame representations, which facilitates the learning of detailed video semantics by providing fine-grained guidance in the semantic space. With a naive temporal fusion strategy (i.e., mean-pooling) on the enhanced frame representations, we obtain state-of-the-art performances on three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC.
翻译:在文本-视频检索中,近期工作通过将预训练的文本-图像基础模型(如CLIP)适配到视频领域,得益于其强大的学习能力。其中关键问题是如何利用CLIP的图像编码器有效捕捉视频内部的丰富语义。为解决此问题,最先进的方法采用复杂的跨模态建模技术将文本信息融合到视频帧表示中,然而,这在大规模检索系统中引发了严重的效率问题,因为视频表示必须为每个文本查询在线重新计算。本文放弃了这种有问题的跨模态融合过程,旨在仅从视频中学习语义增强表示,从而使视频表示可离线计算并复用于不同文本。具体而言,我们首先在CLIP图像编码器中引入时空“提示立方体”,并在编码器层内迭代切换,以高效地将全局视频语义纳入帧表示。随后,我们提出应用辅助视频字幕目标来训练帧表示,通过提供语义空间中的细粒度引导,促进详细视频语义的学习。通过对这些增强帧表示采用简单的时序融合策略(即均值池化),我们在三个基准数据集(MSR-VTT、MSVD和LSMDC)上获得了最先进的性能。