We present HourVideo, a benchmark dataset for hour-long video-language understanding. Our dataset consists of a novel task suite comprising summarization, perception (recall, tracking), visual reasoning (spatial, temporal, predictive, causal, counterfactual), and navigation (room-to-room, object retrieval) tasks. HourVideo includes 500 manually curated egocentric videos from the Ego4D dataset, spanning durations of 20 to 120 minutes, and features 12,976 high-quality, five-way multiple-choice questions. Benchmarking results reveal that multimodal models, including GPT-4 and LLaVA-NeXT, achieve marginal improvements over random chance. In stark contrast, human experts significantly outperform the state-of-the-art long-context multimodal model, Gemini Pro 1.5 (85.0% vs. 37.3%), highlighting a substantial gap in multimodal capabilities. Our benchmark, evaluation toolkit, prompts, and documentation are available at https://hourvideo.stanford.edu
翻译:我们提出了HourVideo,一个用于小时级长视频语言理解的基准数据集。我们的数据集包含一套新颖的任务体系,涵盖摘要生成、感知(回忆、追踪)、视觉推理(空间、时间、预测、因果、反事实)以及导航(房间到房间、物体检索)任务。HourVideo包含从Ego4D数据集中人工精选的500个第一人称视角视频,时长覆盖20至120分钟,并配有12,976道高质量的五选一多项选择题。基准测试结果表明,包括GPT-4和LLaVA-NeXT在内的多模态模型仅略优于随机猜测。与此形成鲜明对比的是,人类专家显著优于当前最先进的长上下文多模态模型Gemini Pro 1.5(85.0% vs. 37.3%),这凸显了多模态能力方面存在的巨大差距。我们的基准数据集、评估工具包、提示词及文档可在https://hourvideo.stanford.edu 获取。