The advancement of Large Vision Language Models (LVLMs) has significantly improved multimodal understanding, yet challenges remain in video reasoning tasks due to the scarcity of high-quality, large-scale datasets. Existing video question-answering (VideoQA) datasets often rely on costly manual annotations with insufficient granularity or automatic construction methods with redundant frame-by-frame analysis, limiting their scalability and effectiveness for complex reasoning. To address these challenges, we introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence, along with multimodal annotations of intermediate reasoning steps. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We further develop video Chain-of-Thought (CoT) annotations to enrich reasoning processes, guiding GPT-4o in extracting logical relationships from QA pairs and video content. To exploit the potential of high-quality VideoQA pairs, we propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM. This framework adaptively selects core frames and performs CoT reasoning using multimodal evidence. Evaluated on our proposed benchmark with 14 tasks against 9 popular LVLMs, our method outperforms existing baselines on most tasks, demonstrating superior video reasoning capabilities. Our code and dataset will be released at: https://github.com/hshjerry/VideoEspresso
翻译:大型视觉语言模型(LVLMs)的发展显著提升了多模态理解能力,但高质量大规模数据集的稀缺性仍使视频推理任务面临挑战。现有的视频问答(VideoQA)数据集通常依赖于成本高昂但粒度不足的人工标注,或采用逐帧分析冗余的自动构建方法,限制了其处理复杂推理任务的可扩展性和有效性。为应对这些挑战,我们提出了VideoEspresso——一个新型数据集,其视频问答对保留了关键的空间细节与时间连贯性,并包含中间推理步骤的多模态标注。我们的构建流程采用语义感知方法降低冗余,随后利用GPT-4o生成问答对。为进一步丰富推理过程,我们开发了视频思维链(CoT)标注,通过引导GPT-4o从问答对和视频内容中提取逻辑关系来实现。为挖掘高质量视频问答对的潜力,我们提出了一种混合LVLMs协作框架,该框架包含帧选择器和两阶段指令微调推理LVLM,能够自适应选择核心帧并利用多模态证据进行思维链推理。在我们提出的包含14项任务的基准测试中,与9种主流LVLMs对比,本方法在多数任务上超越了现有基线,展现出卓越的视频推理能力。代码与数据集发布于:https://github.com/hshjerry/VideoEspresso