Text-to-video retrieval (TVR) aims to find the most relevant video in a large video gallery given a query text. The intricate and abundant context of the video challenges the performance and efficiency of TVR. To handle the serialized video contexts, existing methods typically select a subset of frames within a video to represent the video content for TVR. How to select the most representative frames is a crucial issue, whereby the selected frames are required to not only retain the semantic information of the video but also promote retrieval efficiency by excluding temporally redundant frames. In this paper, we make the first empirical study of frame selection for TVR. We systemically classify existing frame selection methods into text-free and text-guided ones, under which we detailedly analyze six different frame selections in terms of effectiveness and efficiency. Among them, two frame selections are first developed in this paper. According to the comprehensive analysis on multiple TVR benchmarks, we empirically conclude that the TVR with proper frame selections can significantly improve the retrieval efficiency without sacrificing the retrieval performance.
翻译:文本到视频检索(TVR)旨在根据查询文本在大型视频库中找到最相关的视频。视频中复杂而丰富的上下文对TVR的性能和效率提出了挑战。为处理序列化的视频上下文,现有方法通常选择视频中的帧子集来表示视频内容以进行TVR。如何选择最具代表性的帧是一个关键问题,所选帧不仅需要保留视频的语义信息,还需通过排除时间冗余帧来提高检索效率。本文首次对TVR中的帧选择进行实证研究。我们系统性地将现有帧选择方法分为无文本引导和文本引导两类,在此基础上详细分析了六种不同帧选择方法在有效性和效率方面的表现。其中两种帧选择方法是本文首次提出的。基于多个TVR基准数据集的综合分析,我们实证得出:采用适当的帧选择方法,TVR可以在不牺牲检索性能的情况下显著提高检索效率。