We describe a protocol to study text-to-video retrieval training with unlabeled videos, where we assume (i) no access to labels for any videos, i.e., no access to the set of ground-truth captions, but (ii) access to labeled images in the form of text. Using image expert models is a realistic scenario given that annotating images is cheaper therefore scalable, in contrast to expensive video labeling schemes. Recently, zero-shot image experts such as CLIP have established a new strong baseline for video understanding tasks. In this paper, we make use of this progress and instantiate the image experts from two types of models: a text-to-image retrieval model to provide an initial backbone, and image captioning models to provide supervision signal into unlabeled videos. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training. This process adapts the features to the target domain at no manual annotation cost, consequently outperforming the strong zero-shot CLIP baseline. During training, we sample captions from multiple video frames that best match the visual content, and perform a temporal pooling over frame representations by scoring frames according to their relevance to each caption. We conduct extensive ablations to provide insights and demonstrate the effectiveness of this simple framework by outperforming the CLIP zero-shot baselines on text-to-video retrieval on three standard datasets, namely ActivityNet, MSR-VTT, and MSVD.
翻译:我们描述了一种使用无标签视频进行文本到视频检索训练的研究协议,其中假设:(i) 不获取任何视频的标签,即不获取真实描述文本集合,但 (ii) 获取以文本形式标注的图像。鉴于图像标注成本更低且更具可扩展性(相比之下视频标注成本高昂),使用图像专家模型是一种现实场景。近期,CLIP等零样本图像专家模型已为视频理解任务建立了新的强基线。本文利用这一进展,从两类模型中实例化图像专家:用于提供初始骨干网络的文本到图像检索模型,以及用于向无标签视频提供监督信号的图像标注模型。我们证明,通过图像标注自动标记视频帧可实现文本到视频检索训练。该过程在无需人工标注成本的情况下将特征适配至目标域,从而超越强大的零样本CLIP基线。训练时,我们从多帧视频中采样与视觉内容最匹配的描述文本,并通过按每个描述文本与帧的相关性进行评分,对帧表示执行时序池化。我们进行了大量消融实验以提供深度分析,并通过在三个标准数据集(ActivityNet、MSR-VTT和MSVD)上的文本到视频检索任务中超越CLIP零样本基线,证明了该简洁框架的有效性。