Ad-hoc Video Search (AVS) enables users to search for unlabeled video content using on-the-fly textual queries. Current deep learning-based models for AVS are trained to optimize holistic similarity between short videos and their associated descriptions. However, due to the diversity of ad-hoc queries, even for a short video, its truly relevant part w.r.t. a given query can be of shorter duration. In such a scenario, the holistic similarity becomes suboptimal. To remedy the issue, we propose in this paper CLIPRerank, a fine-grained re-scoring method. We compute cross-modal similarities between query and video frames using a pre-trained CLIP model, with multi-frame scores aggregated by max pooling. The fine-grained score is weightedly added to the initial score for search result reranking. As such, CLIPRerank is agnostic to the underlying video retrieval models and extremely simple, making it a handy plug-in for boosting AVS. Experiments on the challenging TRECVID AVS benchmarks (from 2016 to 2021) justify the effectiveness of the proposed strategy. CLIPRerank consistently improves the TRECVID top performers and multiple existing models including SEA, W2VV++, Dual Encoding, Dual Task, LAFF, CLIP2Video, TS2-Net and X-CLIP. Our method also works when substituting BLIP-2 for CLIP.
翻译:即席视频搜索(AVS)允许用户通过即时文本查询搜索未标注的视频内容。当前基于深度学习的AVS模型通过优化短视频与其关联描述的整体相似度进行训练。然而,由于即席查询的多样性,即使对于短视频而言,其与给定查询真正相关的片段也可能持续时间更短。在此类场景下,整体相似度呈现次优性。针对该问题,本文提出一种细粒度重评分方法CLIPRerank。我们利用预训练CLIP模型计算查询与视频帧之间的跨模态相似度,并通过最大池化聚合多帧分数。将细粒度分数与初始分数加权相加,以实现搜索结果的重排序。由此,CLIPRerank对底层视频检索模型具有不可知性且极其简单,可作为便捷插件提升AVS性能。在具有挑战性的TRECVID AVS基准测试(2016至2021年)上的实验验证了所提策略的有效性。CLIPRerank持续改进了TRECVID最佳表现模型及多个现有模型,包括SEA、W2VV++、Dual Encoding、Dual Task、LAFF、CLIP2Video、TS2-Net和X-CLIP。当用BLIP-2替代CLIP时,该方法同样有效。