Video understanding is a pivotal task in the digital era, yet the dynamic and multievent nature of videos makes them labor-intensive and computationally demanding to process. Thus, localizing a specific event given a semantic query has gained importance in both user-oriented applications like video search and academic research into video foundation models. A significant limitation in current research is that semantic queries are typically in natural language that depicts the semantics of the target event. This setting overlooks the potential for multimodal semantic queries composed of images and texts. To address this gap, we introduce a new benchmark, ICQ, for localizing events in videos with multimodal queries, along with a new evaluation dataset ICQ-Highlight. Our new benchmark aims to evaluate how well models can localize an event given a multimodal semantic query that consists of a reference image, which depicts the event, and a refinement text to adjust the images' semantics. To systematically benchmark model performance, we include 4 styles of reference images and 5 types of refinement texts, allowing us to explore model performance across different domains. We propose 3 adaptation methods that tailor existing models to our new setting and evaluate 10 SOTA models, ranging from specialized to large-scale foundation models. We believe this benchmark is an initial step toward investigating multimodal queries in video event localization.
翻译:视频理解是数字时代的关键任务,然而视频的动态性与多事件特性使其处理过程既耗费人力又对计算资源要求极高。因此,给定语义查询定位特定事件,在面向用户的应用(如视频搜索)以及对视频基础模型的学术研究中都日益重要。当前研究的一个显著局限在于,语义查询通常仅采用描述目标事件语义的自然语言。这一设定忽视了由图像和文本构成的多模态语义查询的潜力。为填补这一空白,我们提出了一个新基准ICQ,用于基于多模态查询的视频事件定位,并构建了新的评估数据集ICQ-Highlight。我们的新基准旨在评估模型在给定多模态语义查询(包含描述事件的参考图像和用于调整图像语义的细化文本)时定位事件的能力。为系统评估模型性能,我们涵盖了4种风格的参考图像和5类细化文本,从而能够探索模型在不同领域的表现。我们提出了3种适配方法,使现有模型适用于新设定,并评估了10个从专用模型到大规模基础模型的先进模型。我们相信该基准是探索视频事件定位中多模态查询研究的初步步骤。