We introduce ED-VTG, a method for fine-grained video temporal grounding utilizing multi-modal large language models. Our approach harnesses the capabilities of multimodal LLMs to jointly process text and video, in order to effectively localize natural language queries in videos through a two-stage process. Rather than being directly grounded, language queries are initially transformed into enriched sentences that incorporate missing details and cues to aid in grounding. In the second stage, these enriched queries are grounded, using a lightweight decoder, which specializes at predicting accurate boundaries conditioned on contextualized representations of the enriched queries. To mitigate noise and reduce the impact of hallucinations, our model is trained with a multiple-instance-learning objective that dynamically selects the optimal version of the query for each training sample. We demonstrate state-of-the-art results across various benchmarks in temporal video grounding and paragraph grounding settings. Experiments reveal that our method significantly outperforms all previously proposed LLM-based temporal grounding approaches and is either superior or comparable to specialized models, while maintaining a clear advantage against them in zero-shot evaluation scenarios.
翻译:本文提出ED-VTG方法,一种利用多模态大语言模型实现细粒度视频时序定位的技术。该方法通过两阶段流程,充分发挥多模态大语言模型联合处理文本与视频的能力,从而有效定位视频中的自然语言查询。语言查询并非直接进行定位,而是首先被转换为包含缺失细节与定位线索的增强语句。在第二阶段,这些增强后的查询通过一个轻量级解码器进行定位,该解码器专精于依据增强查询的上下文表征预测精确的时间边界。为抑制噪声并降低幻觉影响,模型采用多示例学习目标进行训练,该目标能够为每个训练样本动态选择最优的查询版本。我们在多种时序视频定位及段落定位基准测试中取得了最先进的结果。实验表明,本方法显著优于所有先前提出的基于大语言模型的时序定位方法,在性能上优于或可比肩专用模型,并在零样本评估场景中保持明显优势。