Video grounding is a fundamental problem in multimodal content understanding, aiming to localize specific natural language queries in an untrimmed video. However, current video grounding datasets merely focus on simple events and are either limited to shorter videos or brief sentences, which hinders the model from evolving toward stronger multimodal understanding capabilities. To address these limitations, we present a large-scale video grounding dataset named SynopGround, in which more than 2800 hours of videos are sourced from popular TV dramas and are paired with accurately localized human-written synopses. Each paragraph in the synopsis serves as a language query and is manually annotated with precise temporal boundaries in the long video. These paragraph queries are tightly correlated to each other and contain a wealth of abstract expressions summarizing video storylines and specific descriptions portraying event details, which enables the model to learn multimodal perception on more intricate concepts over longer context dependencies. Based on the dataset, we further introduce a more complex setting of video grounding dubbed Multi-Paragraph Video Grounding (MPVG), which takes as input multiple paragraphs and a long video for grounding each paragraph query to its temporal interval. In addition, we propose a novel Local-Global Multimodal Reasoner (LGMR) to explicitly model the local-global structures of long-term multimodal inputs for MPVG. Our method provides an effective baseline solution to the multi-paragraph video grounding problem. Extensive experiments verify the proposed model's effectiveness as well as its superiority in long-term multi-paragraph video grounding over prior state-of-the-arts. Dataset and code are publicly available. Project page: https://synopground.github.io/.
翻译:视频定位是多模态内容理解的基础问题,旨在从非裁剪视频中定位特定自然语言查询。然而,现有视频定位数据集仅关注简单事件,且受限于较短视频或简短语句,阻碍了模型向更强多模态理解能力的发展。为克服这些局限,我们提出了名为SynopGround的大规模视频定位数据集,其中包含超过2800小时的流行电视剧视频,并配有精准定位的人工撰写剧情摘要。摘要中的每个段落作为语言查询,均经过人工标注在长视频中的精确时间边界。这些段落查询彼此紧密关联,既包含总结视频故事线的抽象表达,又涵盖描绘事件细节的具体描述,使模型能够在更长上下文依赖中学习对更复杂概念的多模态感知。基于该数据集,我们进一步提出了视频定位中更复杂的设定——多段落视频定位(MPVG),其以多个段落和长视频作为输入,将每个段落查询定位至对应时间区间。此外,我们提出了一种新颖的局部-全局多模态推理器(LGMR),显式建模长时多模态输入的局部-全局结构以解决MPVG任务。该方法为多段落视频定位问题提供了有效的基线解决方案。大量实验验证了所提模型的有效性,及其在长时多段落视频定位任务上相较于现有最优方法的优越性。数据集与代码已公开。项目页面:https://synopground.github.io/。