Multi-event Video-Text Retrieval

Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of massive video-text data on the Internet. A plethora of work characterized by using a two-stream Vision-Language model architecture that learns a joint representation of video-text pairs has become a prominent approach for the VTR task. However, these models operate under the assumption of bijective video-text correspondences and neglect a more practical scenario where video content usually encompasses multiple events, while texts like user queries or webpage metadata tend to be specific and correspond to single events. This establishes a gap between the previous training objective and real-world applications, leading to the potential performance degradation of earlier models during inference. In this study, we introduce the Multi-event Video-Text Retrieval (MeVTR) task, addressing scenarios in which each video contains multiple different events, as a niche scenario of the conventional Video-Text Retrieval Task. We present a simple model, Me-Retriever, which incorporates key event video representation and a new MeVTR loss for the MeVTR task. Comprehensive experiments show that this straightforward framework outperforms other models in the Video-to-Text and Text-to-Video tasks, effectively establishing a robust baseline for the MeVTR task. We believe this work serves as a strong foundation for future studies. Code is available at https://github.com/gengyuanmax/MeVTR.

翻译：视频-文本检索（VTR）是互联网上海量视频-文本数据时代的关键多模态任务。大量研究采用双流视觉-语言模型架构，通过学习视频-文本对的联合表示，已成为VTR任务的主流方法。然而，这些模型基于视频与文本一一对应的假设，忽略了更实际的场景：视频内容通常包含多个事件，而用户查询或网页元数据等文本往往具有特异性且对应单一事件。这造成了先前训练目标与实际应用之间的差距，导致早期模型在推理时性能可能下降。在本研究中，我们提出了多事件视频-文本检索（MeVTR）任务，旨在处理每个视频包含多个不同事件的场景，作为传统视频-文本检索任务的一个细分场景。我们提出了一个简单模型Me-Retriever，该模型融入了关键事件视频表示和用于MeVTR任务的新型MeVTR损失函数。综合实验表明，这一简洁框架在视频到文本和文本到视频任务中优于其他模型，有效建立了MeVTR任务的稳健基线。我们相信这项工作为未来研究奠定了坚实基础。代码已开源：https://github.com/gengyuanmax/MeVTR。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日