Vision-Language Models Learn Super Images for Efficient Partially Relevant Video Retrieval

In this paper, we propose an efficient and high-performance method for partially relevant video retrieval, which aims to retrieve long videos that contain at least one moment relevant to the input text query. The challenge lies in encoding dense frames using visual backbones. This requires models to handle the increased frames, resulting in significant computation costs for long videos. To mitigate the costs, previous studies use lightweight visual backbones, yielding sub-optimal retrieval performance due to their limited capabilities. However, it is undesirable to simply replace the backbones with high-performance large vision-and-language models (VLMs) due to their low efficiency. To address this dilemma, instead of dense frames, we focus on super images, which are created by rearranging the video frames in an $N \times N$ grid layout. This reduces the number of visual encodings to $\frac{1}{N^2}$ and mitigates the low efficiency of large VLMs. Based on this idea, we make two contributions. First, we explore whether VLMs generalize to super images in a zero-shot setting. To this end, we propose a method called query-attentive super image retrieval (QASIR), which attends to partial moments relevant to the input query. The zero-shot QASIR yields two discoveries: (1) it enables VLMs to generalize to super images and (2) the grid size $N$, image resolution, and VLM size are key trade-off parameters between performance and computation costs. Second, we introduce fine-tuning and hybrid QASIR that combines high- and low-efficiency models to strike a balance between performance and computation costs. This reveals two findings: (1) the fine-tuning QASIR enhances VLMs to learn super images effectively, and (2) the hybrid QASIR minimizes the performance drop of large VLMs while reducing the computation costs.

翻译：在本文中，我们提出了一种高效且高性能的部分相关视频检索方法，旨在检索包含至少一个与输入文本查询相关时刻的长视频。其挑战在于使用视觉骨干网络编码密集帧，这要求模型处理增加的帧数，导致长视频的计算成本显著增加。为降低这些成本，先前的研究采用轻量级视觉骨干网络，但由于其能力有限，检索性能欠佳。然而，简单地将骨干网络替换为高性能大型视觉-语言模型（VLM）并不可取，因其效率低下。为解决这一困境，我们聚焦于超级图像而非密集帧——通过将视频帧按$N \times N$网格布局重新排列生成。这可将视觉编码数量减少至$\frac{1}{N^2}$，缓解大型VLM的低效问题。基于此思路，我们作出两项贡献。首先，我们探究VLM在零样本设置下能否泛化到超级图像。为此，我们提出查询注意超级图像检索（QASIR）方法，聚焦与输入查询相关的部分时刻。零样本QASIR揭示两个发现：（1）它使VLM能泛化到超级图像；（2）网格尺寸$N$、图像分辨率和VLM尺寸是性能与计算成本之间的关键权衡参数。其次，我们引入微调与混合QASIR，结合高效与低效模型以平衡性能与计算成本。这带来两项发现：（1）微调QASIR可有效增强VLM对超级图像的学习能力；（2）混合QASIR在降低计算成本的同时最小化大型VLM的性能下降。