Composed Video Retrieval (CoVR) facilitates video retrieval by combining visual and textual queries. However, existing CoVR frameworks typically fuse multimodal inputs in a single stage, achieving only marginal gains over initial baseline. To address this, we propose a novel CoVR framework that leverages the representational power of Vision Language Models (VLMs). Our framework incorporates a novel cross-attention module X-Aligner, composed of cross-attention layers that progressively fuse visual and textual inputs and align their multimodal representation with that of the target video. To further enhance the representation of the multimodal query, we incorporate the caption of the visual query as an additional input. The framework is trained in two stages to preserve the pretrained VLM representation. In the first stage, only the newly introduced module is trained, while in the second stage, the textual query encoder is also fine-tuned. We implement our framework on top of BLIP-family architecture, namely BLIP and BLIP-2, and train it on the Webvid-CoVR data set. In addition to in-domain evaluation on Webvid-CoVR-Test, we perform zero-shot evaluations on the Composed Image Retrieval (CIR) data sets CIRCO and Fashion-IQ. Our framework achieves state-of-the-art performance on CoVR obtaining a Recall@1 of 63.93% on Webvid-CoVR-Test, and demonstrates strong zero-shot generalization on CIR tasks.
翻译:组合式视频检索(CoVR)通过结合视觉与文本查询来促进视频检索。然而,现有的CoVR框架通常将多模态输入在单一阶段进行融合,相较于初始基线仅获得有限的性能提升。为解决此问题,我们提出了一种新颖的CoVR框架,该框架利用视觉语言模型(VLMs)的表征能力。我们的框架引入了一个新颖的交叉注意力模块X-Aligner,该模块由交叉注意力层构成,逐步融合视觉与文本输入,并将其多模态表征与目标视频的表征对齐。为进一步增强多模态查询的表征,我们将视觉查询的标题作为额外输入纳入框架。该框架采用两阶段训练以保持预训练VLM的表征:第一阶段仅训练新引入的模块,第二阶段则同时对文本查询编码器进行微调。我们在BLIP系列架构(即BLIP与BLIP-2)基础上实现了该框架,并在Webvid-CoVR数据集上进行了训练。除了在Webvid-CoVR-Test上进行领域内评估外,我们还在组合式图像检索(CIR)数据集CIRCO和Fashion-IQ上进行了零样本评估。我们的框架在CoVR任务上取得了最先进的性能,在Webvid-CoVR-Test上获得了63.93%的Recall@1,并在CIR任务上展现出强大的零样本泛化能力。