Video temporal grounding (VTG) aims to localize the start and end timestamps of the event described by a given query within an untrimmed video. Despite the strong open-world video understanding and recognition ability of video language large models (Vid-LLMs), outputting precise temporal grounding information remains challenging, since explicit temporal cues are scarce in untrimmed videos, and query-relevant entities are hard to track consistently across the video timeline. In this paper, we present \MarkIt{}, a training-free framework that transforms an input video into a query-conditioned marked video, which empowers Vid-LLMs to generate more reliable temporal localization predictions. The core component of \MarkIt{} is an annotation-free query-to-mask grounding bridge (Q2M-Bridge). Given a natural-language query, it automatically derives a compact set of canonical subject tags through linguistic parsing and normalization, then maps these tags to query-conditioned instance masks using text-conditioned open-vocabulary segmentation. The bridge also embeds lightweight semantic instance markers and a persistent frame index into each frame, effectively transforming long-range temporal reasoning into explicit visual cues for Vid-LLMs. \MarkIt{} adopts an inference-time plug-and-play design, needs no modifications to Vid-LLM weights, and is fully compatible with supervised fine-tuning. Experiments conducted on multiple mainstream moment retrieval and highlight detection benchmarks demonstrate that \MarkIt {} achieves state-of-the-art results, delivering consistent temporal grounding improvements across a wide range of existing models.
翻译:视频时序定位(VTG)旨在无修剪视频中定位给定查询所描述事件的起始与结束时间戳。尽管视频语言大模型(Vid-LLMs)具备强大的开集视频理解与识别能力,但由于无修剪视频中显式时序线索稀缺,且查询相关实体难以在视频时间线上持续追踪,输出精确的时序定位信息仍具挑战。本文提出无需训练框架MarkIt,通过将输入视频转化为查询条件化的标记视频,赋能Vid-LLMs生成更可靠的时序定位预测。其核心组件为免标注的查询-掩码时序定位桥接模块(Q2M-Bridge)。该模块通过语言解析与规范化自动从自然语言查询中提取精简的标准主体标签集,进而利用文本条件化的开集分割将标签映射为查询条件化的实例掩码。桥接模块还向每帧嵌入轻量级语义实例标记与持续帧索引,有效将长程时序推理转化为Vid-LLMs的显式视觉线索。MarkIt采用推理时即插即用设计,无需修改Vid-LLM权重,且完全兼容监督微调。在多项主流时刻检索与高光检测基准上的实验表明,MarkIt实现了最先进性能,并在多种现有模型中取得一致的时序定位提升。