Video temporal grounding (VTG) is typically tackled with dataset-specific models that transfer poorly across domains and query styles. Recent efforts to overcome this limitation have adapted large multimodal language models (MLLMs) to VTG, but their high compute cost and limited video context still hinder long-video grounding. We instead scale unified supervision while keeping the model lightweight. We present UniversalVTG, a single VTG model trained with large-scale cross-dataset pretraining. An offline Query Unifier canonicalizes heterogeneous query formats into a shared declarative space, reducing linguistic mismatch and preventing the negative transfer observed under naïve joint training. Combined with an efficient grounding head, UniversalVTG scales to long, untrimmed videos. Across diverse benchmarks-GoalStep-StepGrounding, Ego4D-NLQ, TACoS, Charades-STA, and ActivityNet-Captions-one UniversalVTG checkpoint achieves state-of-the-art performance versus dedicated VTG models. Moreover, despite being $>100\times$ smaller than recent MLLM-based approaches, UniversalVTG matches or exceeds their accuracy on multiple benchmarks, offering a practical alternative to parameter-heavy MLLMs.
翻译:视频时序定位(VTG)通常采用数据集专用模型处理,这类模型在跨领域和查询风格时的迁移能力较弱。近期试图克服这一局限的研究将大型多模态语言模型(MLLM)应用于VTG,但其高昂计算成本与有限的视频上下文仍制约着长视频的时序定位。为此,我们另辟蹊径,在保持模型轻量化的同时扩展统一监督。我们提出UniversalVTG,一种通过大规模跨数据集预训练训练而成的单模型。离线查询统一器可将异构查询格式规范化为共享声明式空间,从而减少语言不匹配并防止朴素联合训练中出现的负迁移现象。结合高效定位头,UniversalVTG可扩展至长视频与未裁剪视频。在GoalStep-StepGrounding、Ego4D-NLQ、TACoS、Charades-STA和ActivityNet-Captions等多样化基准测试中,单个UniversalVTG检查点相较专用VTG模型取得了最先进性能。此外,尽管体积比近期基于MLLM的方法小100倍以上,UniversalVTG在多个基准测试中的准确率仍可媲美甚至超越后者,为参数密集型MLLM提供了实用替代方案。