In this paper, we propose an efficient and high-performance method for partially relevant video retrieval (PRVR), which aims to retrieve untrimmed long videos that contain at least one relevant moment to the input text query. In terms of both efficiency and performance, the overlooked bottleneck of previous studies is the visual encoding of dense frames. This guides researchers to choose lightweight visual backbones, yielding sub-optimal retrieval performance due to their limited capabilities of learned visual representations. However, it is undesirable to simply replace them with high-performance large-scale vision-and-language models (VLMs) due to their low efficiency. To address these issues, instead of dense frames, we focus on super images, which are created by rearranging the video frames in a $N \times N$ grid layout. This reduces the number of visual encodings to $\frac{1}{N^2}$ and compensates for the low efficiency of large-scale VLMs, allowing us to adopt them as powerful encoders. Surprisingly, we discover that with a simple query-image attention trick, VLMs generalize well to super images effectively and demonstrate promising zero-shot performance against SOTA methods efficiently. In addition, we propose a fine-tuning approach by incorporating a few trainable modules into the VLM backbones. The experimental results demonstrate that our approaches efficiently achieve the best performance on ActivityNet Captions and TVR.
翻译:本文提出了一种高效且高性能的部分相关视频检索(PRVR)方法,旨在检索至少包含一个与输入文本查询相关时刻的未修剪长视频。从效率和性能两方面来看,以往研究中被忽视的瓶颈在于密集帧的视觉编码。这促使研究者选择轻量级视觉骨干网络,但由于其学习视觉表征的能力有限,导致检索性能次优。然而,直接将其替换为高性能的大规模视觉-语言模型(VLM)并不可取,因为其效率较低。为解决这些问题,我们不关注密集帧,而是聚焦于超级图像——通过将视频帧以$N \times N$网格布局重新排列生成。这使视觉编码次数减少至原来的$\frac{1}{N^2}$,并弥补了大规模VLM的低效性,从而将其作为强大编码器。令人惊讶的是,我们发现通过简单的查询-图像注意力技巧,VLM能有效泛化到超级图像,并以高效方式展现出优于现有最佳方法的零样本性能。此外,我们提出了一种微调方法,在VLM骨干网络中引入少量可训练模块。实验结果表明,我们的方法在ActivityNet Captions和TVR数据集上高效地实现了最优性能。