This paper introduces LongViTU, a large-scale (~121k QA pairs, ~900h videos), automatically generated dataset for long-form video understanding. We propose a systematic approach that organizes videos into a hierarchical tree structure for QA generation and incorporates self-revision mechanisms to ensure high-quality QA pairs. Each QA pair in LongViTU features: 1) long-term context (average certificate length of 4.6 minutes); 2) rich knowledge and condensed reasoning (commonsense, causality, planning, etc.)). We also offer explicit timestamp annotations of relevant events for each QA pair. We have conducted extensive human studies on LongViTU, and the results prove the quality of our dataset. To better evaluate the challenges posed by LongViTU's emphasis on long-term context and condensed reasoning, we manually curate a subset of LongViTU into a benchmark. Evaluations using a state-of-the-art open-source model (LongVU), a proprietary model (Gemini-1.5-Pro), and human annotators yield GPT-4 scores of 49.9, 52.3, and 81.0, respectively, underscoring the substantial difficulty presented by LongViTU questions. Performing supervised fine-tuning (SFT) of LongVU and LLaVA-Video on LongViTU data results in average performance gains of 2.5% and 3.7%, respectively, across a suite of long video understanding benchmarks (EgoSchema, VideoMME-Long, MLVU, LVBench).
翻译:本文介绍了LongViTU,一个用于长视频理解的大规模(约12.1万个问答对,约900小时视频)、自动生成的数据集。我们提出了一种系统性的方法,将视频组织成层次化的树状结构以生成问答对,并整合了自我修订机制以确保问答对的高质量。LongViTU中的每个问答对均具备以下特征:1)长期上下文(平均证书长度为4.6分钟);2)丰富的知识与凝练的推理(常识、因果性、规划等)。我们还为每个问答对提供了相关事件的显式时间戳标注。我们在LongViTU上进行了广泛的人工研究,结果证明了我们数据集的质量。为了更好地评估LongViTU所强调的长期上下文和凝练推理带来的挑战,我们手动将LongViTU的一个子集整理为基准测试。使用最先进的开源模型(LongVU)、专有模型(Gemini-1.5-Pro)以及人工标注者进行评估,分别获得了49.9、52.3和81.0的GPT-4分数,这凸显了LongViTU问题带来的巨大难度。在LongViTU数据上对LongVU和LLaVA-Video进行监督微调(SFT),在一系列长视频理解基准(EgoSchema、VideoMME-Long、MLVU、LVBench)上分别实现了平均2.5%和3.7%的性能提升。