CinePile: A Long Video Question Answering Dataset and Benchmark

from arxiv, Project page with all the artifacts - https://ruchitrawal.github.io/cinepile/. Updated version with adversarial refinement pipeline and more model evaluations

Current datasets for long-form video understanding often fall short of providing genuine long-form comprehension challenges, as many tasks derived from these datasets can be successfully tackled by analyzing just one or a few random frames from a video. To address this issue, we present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. This paper details our innovative approach for creating a question-answer dataset, utilizing advanced LLMs with human-in-the-loop and building upon human-generated raw data. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects, including temporal comprehension, understanding human-object interactions, and reasoning about events or actions within a scene. Additionally, we fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset. The findings indicate that although current models underperform compared to humans, fine-tuning these models can lead to significant improvements in their performance.

翻译：当前的长视频理解数据集往往未能提供真正的长视频理解挑战，因为基于这些数据集衍生的许多任务，仅通过分析视频中一个或几个随机帧即可成功完成。为解决此问题，我们提出了一个专门为真实长视频理解设计的新型数据集与基准——CinePile。本文详述了我们创建问答数据集的创新方法，该方法利用先进的大型语言模型并采用人机协同循环，且建立在人工生成的原始数据基础之上。我们的综合性数据集包含305,000个多项选择题，涵盖多种视觉与多模态维度，包括时序理解、人物-物体交互理解以及场景内事件或行为的推理。此外，我们在训练集上对开源视频大语言模型进行了微调，并在测试集上评估了开源与专有的视频中心化大语言模型。研究结果表明，尽管当前模型的性能尚不及人类，但对这些模型进行微调可显著提升其表现。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日