Measure Twice, Cut Once: A Semantic-Oriented Approach to Video Temporal Localization with Video LLMs

Temporally localizing user-queried events through natural language is a crucial capability for video models. Recent methods predominantly adapt video LLMs to generate event boundary timestamps for temporal localization tasks, which struggle to leverage LLMs' pre-trained semantic understanding capabilities due to the uninformative nature of timestamp outputs. In this work, we explore a timestamp-free, semantic-oriented framework that fine-tunes video LLMs using two generative learning tasks and one discriminative learning task. We first introduce a structural token generation task that enables the video LLM to recognize the temporal structure of input videos based on the input query. Through this task, the video LLM generates a sequence of special tokens, called structural tokens, which partition the video into consecutive segments and categorize them as either target events or background transitions. To enhance precise recognition of event segments, we further propose a query-focused captioning task that enables the video LLM to extract fine-grained event semantics that can be effectively utilized by the structural tokens. Finally, we introduce a structural token grounding module driven by contrastive learning to associate each structural token with its corresponding video segment, achieving holistic temporal segmentation of the input video and readily yielding the target event segments for localization. Extensive experiments across diverse temporal localization tasks demonstrate that our proposed framework, MeCo, consistently outperforms methods relying on boundary timestamp generation, highlighting the potential of a semantic-driven approach for temporal localization with video LLMs \footnote{Code available at https://github.com/pangzss/MeCo.

翻译：通过自然语言在时间上定位用户查询的事件是视频模型的关键能力。现有方法主要将视频LLM适配为生成事件边界时间戳以完成时序定位任务，但由于时间戳输出缺乏信息性，这些方法难以充分利用LLM预训练的语义理解能力。本文探索了一种无需时间戳、面向语义的框架，通过两项生成式学习任务和一项判别式学习任务对视频LLM进行微调。我们首先引入结构化标记生成任务，使视频LLM能够根据输入查询识别输入视频的时间结构。通过该任务，视频LLM生成一系列称为结构化标记的特殊标记，这些标记将视频划分为连续片段，并将其归类为目标事件或背景过渡。为提升对事件片段的精确识别能力，我们进一步提出查询聚焦描述生成任务，使视频LLM能够提取可由结构化标记有效利用的细粒度事件语义。最后，我们设计了基于对比学习的结构化标记定位模块，将每个结构化标记与其对应的视频片段关联起来，实现输入视频的整体时间分割，并直接生成用于定位的目标事件片段。在多种时序定位任务上的大量实验表明，我们提出的MeCo框架持续优于依赖边界时间戳生成的方法，凸显了语义驱动方法在视频LLM时序定位中的潜力\footnote{代码发布于https://github.com/pangzss/MeCo。}