For text-to-video retrieval (T2VR), which aims to retrieve unlabeled videos by ad-hoc textual queries, CLIP-based methods are dominating. Compared to CLIP4Clip which is efficient and compact, the state-of-the-art models tend to compute video-text similarity by fine-grained cross-modal feature interaction and matching, putting their scalability for large-scale T2VR into doubt. For efficient T2VR, we propose TeachCLIP with multi-grained teaching to let a CLIP4Clip based student network learn from more advanced yet computationally heavy models such as X-CLIP, TS2-Net and X-Pool . To improve the student's learning capability, we add an Attentional frame-Feature Aggregation (AFA) block, which by design adds no extra storage/computation overhead at the retrieval stage. While attentive weights produced by AFA are commonly used for combining frame-level features, we propose a novel use of the weights to let them imitate frame-text relevance estimated by the teacher network. As such, AFA provides a fine-grained learning (teaching) channel for the student (teacher). Extensive experiments on multiple public datasets justify the viability of the proposed method.
翻译:针对以任意文本查询检索未标注视频的文本-视频检索任务(T2VR),基于CLIP的方法占据主导地位。与高效紧凑的CLIP4Clip相比,当前最优模型倾向于通过细粒度跨模态特征交互与匹配来计算视频-文本相似度,这使其在大规模T2VR中的可扩展性存疑。为实现高效T2VR,我们提出采用多粒度教学的TeachCLIP,使基于CLIP4Clip的学生网络能够从更先进但计算量更大的模型(如X-CLIP、TS2-Net和X-Pool)中学习。为提升学生的学习能力,我们引入注意力机制帧特征聚合模块(AFA),该模块在设计上不会在检索阶段增加额外存储/计算开销。AFA产生的注意力权重通常用于组合帧级特征,而我们提出利用这些权重新颖地模拟教师网络评估的帧-文本相关性。由此,AFA为学生(教师)提供了细粒度学习(教学)通道。在多个公开数据集上的大量实验验证了所提方法的可行性。