Localizing events in videos based on semantic queries is a pivotal task in video understanding, with the growing significance of user-oriented applications like video search. Yet, current research predominantly relies on natural language queries (NLQs), overlooking the potential of using multimodal queries (MQs) that integrate images to more flexibly represent semantic queries -- especially when it is difficult to express non-verbal or unfamiliar concepts in words. To bridge this gap, we introduce ICQ, a new benchmark designed for localizing events in videos with MQs, alongside an evaluation dataset ICQ-Highlight. To accommodate and evaluate existing video localization models for this new task, we propose 3 Multimodal Query Adaptation methods and a novel Surrogate Fine-tuning on pseudo-MQs strategy. ICQ systematically benchmarks 12 state-of-the-art backbone models, spanning from specialized video localization models to Video LLMs, across diverse application domains. Our experiments highlight the high potential of MQs in real-world applications. We believe this benchmark is a first step toward advancing MQs in video event localization.
翻译:基于语义查询的视频事件定位是视频理解中的关键任务,随着视频搜索等面向用户应用的重要性日益增长。然而,当前研究主要依赖自然语言查询,忽视了利用整合图像的多模态查询更灵活地表示语义查询的潜力——尤其是在难以用语言表达非语言或陌生概念时。为弥补这一空白,我们提出了ICQ,这是一个专为多模态查询视频事件定位设计的新基准,并附带评估数据集ICQ-Highlight。为使现有视频定位模型适应并评估这一新任务,我们提出了3种多模态查询适应方法和一种基于伪多模态查询的代理微调新策略。ICQ系统性地对12个最先进的骨干模型进行了基准测试,涵盖从专用视频定位模型到视频大语言模型的多个应用领域。我们的实验突显了多模态查询在实际应用中的巨大潜力。我们相信该基准是推动多模态查询在视频事件定位中发展的第一步。