SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses

Video grounding is a fundamental problem in multimodal content understanding, aiming to localize specific natural language queries in an untrimmed video. However, current video grounding datasets merely focus on simple events and are either limited to shorter videos or brief sentences, which hinders the model from evolving toward stronger multimodal understanding capabilities. To address these limitations, we present a large-scale video grounding dataset named SynopGround, in which more than 2800 hours of videos are sourced from popular TV dramas and are paired with accurately localized human-written synopses. Each paragraph in the synopsis serves as a language query and is manually annotated with precise temporal boundaries in the long video. These paragraph queries are tightly correlated to each other and contain a wealth of abstract expressions summarizing video storylines and specific descriptions portraying event details, which enables the model to learn multimodal perception on more intricate concepts over longer context dependencies. Based on the dataset, we further introduce a more complex setting of video grounding dubbed Multi-Paragraph Video Grounding (MPVG), which takes as input multiple paragraphs and a long video for grounding each paragraph query to its temporal interval. In addition, we propose a novel Local-Global Multimodal Reasoner (LGMR) to explicitly model the local-global structures of long-term multimodal inputs for MPVG. Our method provides an effective baseline solution to the multi-paragraph video grounding problem. Extensive experiments verify the proposed model's effectiveness as well as its superiority in long-term multi-paragraph video grounding over prior state-of-the-arts. Dataset and code are publicly available. Project page: https://synopground.github.io/.

翻译：视频定位是多模态内容理解中的基础问题，旨在从非裁剪视频中定位特定自然语言查询对应的片段。然而，当前的视频定位数据集仅关注简单事件，且受限于较短的视频或简短的句子，这阻碍了模型向更强的多模态理解能力演进。为应对这些局限，我们提出了一个名为SynopGround的大规模视频定位数据集，其中包含超过2800小时的视频素材，源自热门电视剧，并与经过精确时序定位的人工撰写剧情摘要配对。摘要中的每个段落均作为语言查询，并已通过人工标注其在长视频中的精确时间边界。这些段落查询彼此紧密关联，既包含总结视频故事线的抽象表达，也涵盖描述事件细节的具体叙述，使得模型能够在更长的上下文依赖中学习对更复杂概念的多模态感知。基于该数据集，我们进一步引入了一种更复杂的视频定位设定，称为多段落视频定位（MPVG），其输入为多个段落与一段长视频，目标是将每个段落查询定位到其对应的时间区间。此外，我们提出了一种新颖的局部-全局多模态推理器（LGMR），以显式建模长时多模态输入的局部-全局结构，用于MPVG任务。我们的方法为多段落视频定位问题提供了一个有效的基线解决方案。大量实验验证了所提出模型的有效性，及其在长时多段落视频定位任务上相较于现有先进方法的优越性。数据集与代码已公开。项目页面：https://synopground.github.io/。