Video question answering (VideoQA) aims to answer questions about given videos. While existing approaches excel on factoid VideoQA, they struggle with deep video understanding (DVU), which requires the comprehension of complex storylines. This challenge arises from the inherent long-range video content, multi-faceted question types, and instance-level story elements, all of which constrain the scale and diversity of manually constructed DVU datasets. These difficulties constrain the scale and diversity of manually-constructed DVU dataset. To address these, we previously introduced StoryMind to automatically construct DVU datasets with balanced fine-grained topics. Though it can generate high-quality question-answer pairs (QAs) for TV series, it suffers significant performance degradation when handling longer and more complex movies. In this paper, we further design StoryMindv2, an enhanced multi-agent collaboration framework to generate high-quality DVU datasets for both TV series and movies. By integrating a novel supervisor-guided generation mechanism and a refined multi-reviewer voting strategy, the framework is utilized to construct StoryVideoQA, the largest DVU dataset to date, featuring over 363K QAs on 393.2 hours diverse story videos including TV series (avg. 1,635 seconds) and movies (avg. 7,878 seconds). Comprehensive evaluations of 20 state-of-the-art VideoQA methods on this large-scale benchmark reveal that they cannot fully maintain long-range character associations or construct a coherent understanding of complex storylines. To bridge this gap, we propose PlotTree, a novel video understanding agent, re-organizing long-range video content into a hierarchical plot structure, enabling efficient storyline reasoning on StoryVideoQA. Project page: https://github.com/nercms-mmap/StoryVideoQA/
翻译:视频问答(VideoQA)旨在回答关于给定视频的问题。现有方法虽然在事实性视频问答上表现出色,但在需要理解复杂故事情节的深度视频理解(DVU)方面仍存在困难。这一挑战源于视频内容固有的长程特性、多面问题类型以及实例级故事元素,这些因素限制了人工构建DVU数据集的规模与多样性。为应对这些问题,我们此前提出了StoryMind框架以自动构建具有平衡细粒度主题的DVU数据集。尽管该框架能为电视剧生成高质量问答对(QA),但在处理时长更长、情节更复杂的电影时性能显著下降。本文进一步设计StoryMindv2——一种增强型多智能体协作框架,可为电视剧和电影生成高质量DVU数据集。通过集成新型监督引导生成机制与改进的多评审者投票策略,该框架被用于构建StoryVideoQA——迄今规模最大的DVU数据集,包含覆盖393.2小时多样故事视频(电视剧平均1635秒,电影平均7878秒)的逾36.3万个问答对。基于该大规模基准对20种最先进视频问答方法的全面评估表明,现有方法无法完整维持长程角色关联,也无法构建对复杂故事情节的连贯理解。为此,我们提出新型视频理解智能体PlotTree,将长程视频内容重新组织为层次化情节结构,从而实现对StoryVideoQA的高效故事情节推理。项目主页:https://github.com/nercms-mmap/StoryVideoQA/