EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding

We introduce EgoSchema, a very long-form video question-answering dataset, and benchmark to evaluate long video understanding capabilities of modern vision and language systems. Derived from Ego4D, EgoSchema consists of over 5000 human curated multiple choice question answer pairs, spanning over 250 hours of real video data, covering a very broad range of natural human activity and behavior. For each question, EgoSchema requires the correct answer to be selected between five given options based on a three-minute-long video clip. While some prior works have proposed video datasets with long clip lengths, we posit that merely the length of the video clip does not truly capture the temporal difficulty of the video task that is being considered. To remedy this, we introduce temporal certificate sets, a general notion for capturing the intrinsic temporal understanding length associated with a broad range of video understanding tasks & datasets. Based on this metric, we find EgoSchema to have intrinsic temporal lengths over 5.7x longer than the second closest dataset and 10x to 100x longer than any other video understanding dataset. Further, our evaluation of several current state-of-the-art video and language models shows them to be severely lacking in long-term video understanding capabilities. Even models with several billions of parameters achieve QA accuracy less than 33% (random is 20%) on the EgoSchema multi-choice question answering task, while humans achieve about 76% accuracy. We posit that \name{}{}, with its long intrinsic temporal structures and diverse complexity, would serve as a valuable evaluation probe for developing effective long-term video understanding systems in the future. Data and Zero-shot model evaluation code are open-sourced for both public and commercial use under the Ego4D license at http://egoschema.github.io

翻译：我们提出EgoSchema——一个极长格式视频问答数据集与基准，用于评估现代视觉与语言系统的长视频理解能力。该数据集源自Ego4D，包含超过5000个人工策划的多选题问答对，涵盖250余小时的真实视频数据，涉及极其广泛的自然人类活动与行为。对于每个问题，EgoSchema要求基于一段三分钟长的视频片段，从五个给定选项中选出正确答案。尽管已有研究提出了包含长片段长度的视频数据集，但我们认为，仅凭视频片段时长本身并不能真正反映所考虑视频任务的时间难度。为解决这一问题，我们引入"时间认证集"（temporal certificate sets）这一通用概念，用以捕捉广泛视频理解任务与数据集的内在时间理解长度。基于该指标，我们发现EgoSchema的内在时间长度是第二接近数据集的5.7倍以上，是其他任何视频理解数据集的10至100倍。进一步地，我们对当前多个最先进视频与语言模型的评估显示，它们在长期视频理解能力上严重不足。即便是拥有数十亿参数的模型，在EgoSchema多项选择问答任务上的准确率也低于33%（随机水平为20%），而人类准确率约为76%。我们认为，EgoSchema凭借其长期内在时间结构与多样化复杂度，将为未来开发有效的长期视频理解系统提供宝贵的评估探针。数据和零样本模型评估代码已在Ego4D许可下向公众及商业用途开源，见http://egoschema.github.io。