StoryVideoQA: Scaling Deep Video Understanding with a Large-Scale, Multi-Genre and Auto-Generated Dataset

Video question answering (VideoQA) aims to answer questions about given videos. While existing approaches excel on factoid VideoQA, they struggle with deep video understanding (DVU), which requires the comprehension of complex storylines. This challenge arises from the inherent long-range video content, multi-faceted question types, and instance-level story elements, all of which constrain the scale and diversity of manually constructed DVU datasets. These difficulties constrain the scale and diversity of manually-constructed DVU dataset. To address these, we previously introduced StoryMind to automatically construct DVU datasets with balanced fine-grained topics. Though it can generate high-quality question-answer pairs (QAs) for TV series, it suffers significant performance degradation when handling longer and more complex movies. In this paper, we further design StoryMindv2, an enhanced multi-agent collaboration framework to generate high-quality DVU datasets for both TV series and movies. By integrating a novel supervisor-guided generation mechanism and a refined multi-reviewer voting strategy, the framework is utilized to construct StoryVideoQA, the largest DVU dataset to date, featuring over 363K QAs on 393.2 hours diverse story videos including TV series (avg. 1,635 seconds) and movies (avg. 7,878 seconds). Comprehensive evaluations of 20 state-of-the-art VideoQA methods on this large-scale benchmark reveal that they cannot fully maintain long-range character associations or construct a coherent understanding of complex storylines. To bridge this gap, we propose PlotTree, a novel video understanding agent, re-organizing long-range video content into a hierarchical plot structure, enabling efficient storyline reasoning on StoryVideoQA. Project page: https://github.com/nercms-mmap/StoryVideoQA/

翻译：视频问答（VideoQA）旨在回答关于给定视频的问题。现有方法虽然在事实性视频问答上表现出色，但在需要理解复杂故事情节的深度视频理解（DVU）方面仍存在困难。这一挑战源于视频内容固有的长程特性、多面问题类型以及实例级故事元素，这些因素限制了人工构建DVU数据集的规模与多样性。为应对这些问题，我们此前提出了StoryMind框架以自动构建具有平衡细粒度主题的DVU数据集。尽管该框架能为电视剧生成高质量问答对（QA），但在处理时长更长、情节更复杂的电影时性能显著下降。本文进一步设计StoryMindv2——一种增强型多智能体协作框架，可为电视剧和电影生成高质量DVU数据集。通过集成新型监督引导生成机制与改进的多评审者投票策略，该框架被用于构建StoryVideoQA——迄今规模最大的DVU数据集，包含覆盖393.2小时多样故事视频（电视剧平均1635秒，电影平均7878秒）的逾36.3万个问答对。基于该大规模基准对20种最先进视频问答方法的全面评估表明，现有方法无法完整维持长程角色关联，也无法构建对复杂故事情节的连贯理解。为此，我们提出新型视频理解智能体PlotTree，将长程视频内容重新组织为层次化情节结构，从而实现对StoryVideoQA的高效故事情节推理。项目主页：https://github.com/nercms-mmap/StoryVideoQA/

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

音视频大数据基础模型全面综述

专知会员服务

9+阅读 · 5月7日

Video-LMM后训练：多模态大模型的视频推理深度解析

专知会员服务

16+阅读 · 2025年10月7日

【万字长文】视觉问答VQA：从早期发展到最新进展——综述

专知会员服务

26+阅读 · 2025年1月8日

探索长视频生成的最新趋势

专知会员服务

23+阅读 · 2024年12月30日