Video Temporal Grounding (VTG), which aims to ground target clips from videos (such as consecutive intervals or disjoint shots) according to custom language queries (e.g., sentences or words), is key for video browsing on social media. Most methods in this direction develop taskspecific models that are trained with type-specific labels, such as moment retrieval (time interval) and highlight detection (worthiness curve), which limits their abilities to generalize to various VTG tasks and labels. In this paper, we propose to Unify the diverse VTG labels and tasks, dubbed UniVTG, along three directions: Firstly, we revisit a wide range of VTG labels and tasks and define a unified formulation. Based on this, we develop data annotation schemes to create scalable pseudo supervision. Secondly, we develop an effective and flexible grounding model capable of addressing each task and making full use of each label. Lastly, thanks to the unified framework, we are able to unlock temporal grounding pretraining from large-scale diverse labels and develop stronger grounding abilities e.g., zero-shot grounding. Extensive experiments on three tasks (moment retrieval, highlight detection and video summarization) across seven datasets (QVHighlights, Charades-STA, TACoS, Ego4D, YouTube Highlights, TVSum, and QFVS) demonstrate the effectiveness and flexibility of our proposed framework. The codes are available at https://github.com/showlab/UniVTG.
翻译:视频时间定位旨在根据自定义语言查询(如句子或单词)从视频中定位目标片段(如连续区间或非连续镜头),是社交媒体视频浏览的关键技术。现有方法大多针对特定任务训练专用模型,依赖特定类型的标签(如时刻检索中的时间区间与高亮检测中的价值曲线),这限制了模型在不同视频时间定位任务和标签上的泛化能力。本文提出沿三个方向统一多样化的视频时间定位标签与任务,命名为UniVTG:首先,重新审视广泛的视频时间定位标签与任务,定义统一公式,并据此开发数据标注方案以生成可扩展的伪监督;其次,构建高效且灵活的定位模型,使其能应对各项任务并充分利用每种标签;最后,借助统一框架,实现从大规模多样化标签中进行时间定位预训练,进而增强定位能力(如零样本定位)。在三个任务(时刻检索、高亮检测与视频摘要)的七个数据集(QVHighlights、Charades-STA、TACoS、Ego4D、YouTube Highlights、TVSum和QFVS)上的大量实验验证了所提框架的有效性与灵活性。相关代码已开源至https://github.com/showlab/UniVTG。