Current datasets for long-form video understanding often fall short of providing genuine long-form comprehension challenges, as many tasks derived from these datasets can be successfully tackled by analyzing just one or a few random frames from a video. To address this issue, we present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. This paper details our innovative approach for creating a question-answer dataset, utilizing advanced LLMs with human-in-the-loop and building upon human-generated raw data. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects, including temporal comprehension, understanding human-object interactions, and reasoning about events or actions within a scene. Additionally, we fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset. The findings indicate that although current models underperform compared to humans, fine-tuning these models can lead to significant improvements in their performance.
翻译:当前的长视频理解数据集往往未能提供真正的长视频理解挑战,因为基于这些数据集衍生的许多任务,仅通过分析视频中一个或几个随机帧即可成功完成。为解决此问题,我们提出了一个专门为真实长视频理解设计的新型数据集与基准——CinePile。本文详述了我们创建问答数据集的创新方法,该方法利用先进的大型语言模型并采用人机协同循环,且建立在人工生成的原始数据基础之上。我们的综合性数据集包含305,000个多项选择题,涵盖多种视觉与多模态维度,包括时序理解、人物-物体交互理解以及场景内事件或行为的推理。此外,我们在训练集上对开源视频大语言模型进行了微调,并在测试集上评估了开源与专有的视频中心化大语言模型。研究结果表明,尽管当前模型的性能尚不及人类,但对这些模型进行微调可显著提升其表现。