Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of massive video-text data on the Internet. A plethora of work characterized by using a two-stream Vision-Language model architecture that learns a joint representation of video-text pairs has become a prominent approach for the VTR task. However, these models operate under the assumption of bijective video-text correspondences and neglect a more practical scenario where video content usually encompasses multiple events, while texts like user queries or webpage metadata tend to be specific and correspond to single events. This establishes a gap between the previous training objective and real-world applications, leading to the potential performance degradation of earlier models during inference. In this study, we introduce the Multi-event Video-Text Retrieval (MeVTR) task, addressing scenarios in which each video contains multiple different events, as a niche scenario of the conventional Video-Text Retrieval Task. We present a simple model, Me-Retriever, which incorporates key event video representation and a new MeVTR loss for the MeVTR task. Comprehensive experiments show that this straightforward framework outperforms other models in the Video-to-Text and Text-to-Video tasks, effectively establishing a robust baseline for the MeVTR task. We believe this work serves as a strong foundation for future studies. Code is available at https://github.com/gengyuanmax/MeVTR.
翻译:视频-文本检索(VTR)是互联网海量视频-文本数据时代一项关键的多模态任务。大量研究工作采用双流视觉-语言模型架构来学习视频-文本对的联合表示,已成为VTR任务的主流方法。然而,这些模型基于视频与文本双射对应的假设,忽略了一个更实际的场景:视频内容通常包含多个事件,而用户查询或网页元数据等文本往往具有特定性且对应单一事件。这导致以往训练目标与现实应用之间存在差距,可能造成早期模型在推理阶段性能下降。本研究提出多事件视频-文本检索(MeVTR)任务,针对每个视频包含多个不同事件的场景,将其视为传统视频-文本检索任务的细分场景。我们提出了一个简单模型Me-Retriever,该模型结合关键事件视频表示和新的MeVTR损失函数来处理MeVTR任务。综合实验表明,该简洁框架在视频到文本和文本到视频检索任务中均优于其他模型,为MeVTR任务建立了稳健的基准。我们相信这项工作为未来研究奠定了坚实基础。代码发布于https://github.com/gengyuanmax/MeVTR。