VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

The advancement of Large Vision Language Models (LVLMs) has significantly improved multimodal understanding, yet challenges remain in video reasoning tasks due to the scarcity of high-quality, large-scale datasets. Existing video question-answering (VideoQA) datasets often rely on costly manual annotations with insufficient granularity or automatic construction methods with redundant frame-by-frame analysis, limiting their scalability and effectiveness for complex reasoning. To address these challenges, we introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence, along with multimodal annotations of intermediate reasoning steps. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We further develop video Chain-of-Thought (CoT) annotations to enrich reasoning processes, guiding GPT-4o in extracting logical relationships from QA pairs and video content. To exploit the potential of high-quality VideoQA pairs, we propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM. This framework adaptively selects core frames and performs CoT reasoning using multimodal evidence. Evaluated on our proposed benchmark with 14 tasks against 9 popular LVLMs, our method outperforms existing baselines on most tasks, demonstrating superior video reasoning capabilities. Our code and dataset will be released at: https://github.com/hshjerry/VideoEspresso

翻译：大型视觉语言模型（LVLMs）的发展显著提升了多模态理解能力，但高质量大规模数据集的稀缺性仍使视频推理任务面临挑战。现有的视频问答（VideoQA）数据集通常依赖于成本高昂但粒度不足的人工标注，或采用逐帧分析冗余的自动构建方法，限制了其处理复杂推理任务的可扩展性和有效性。为应对这些挑战，我们提出了VideoEspresso——一个新型数据集，其视频问答对保留了关键的空间细节与时间连贯性，并包含中间推理步骤的多模态标注。我们的构建流程采用语义感知方法降低冗余，随后利用GPT-4o生成问答对。为进一步丰富推理过程，我们开发了视频思维链（CoT）标注，通过引导GPT-4o从问答对和视频内容中提取逻辑关系来实现。为挖掘高质量视频问答对的潜力，我们提出了一种混合LVLMs协作框架，该框架包含帧选择器和两阶段指令微调推理LVLM，能够自适应选择核心帧并利用多模态证据进行思维链推理。在我们提出的包含14项任务的基准测试中，与9种主流LVLMs对比，本方法在多数任务上超越了现有基线，展现出卓越的视频推理能力。代码与数据集发布于：https://github.com/hshjerry/VideoEspresso

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日