LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

Large multimodal models (LMMs) are processing increasingly longer and richer inputs. Albeit the progress, few public benchmark is available to measure such development. To mitigate this gap, we introduce LongVideoBench, a question-answering benchmark that features video-language interleaved inputs up to an hour long. Our benchmark includes 3,763 varying-length web-collected videos with their subtitles across diverse themes, designed to comprehensively evaluate LMMs on long-term multimodal understanding. To achieve this, we interpret the primary challenge as to accurately retrieve and reason over detailed multimodal information from long inputs. As such, we formulate a novel video question-answering task termed referring reasoning. Specifically, as part of the question, it contains a referring query that references related video contexts, called referred context. The model is then required to reason over relevant video details from the referred context. Following the paradigm of referring reasoning, we curate 6,678 human-annotated multiple-choice questions in 17 fine-grained categories, establishing one of the most comprehensive benchmarks for long-form video understanding. Evaluations suggest that the LongVideoBench presents significant challenges even for the most advanced proprietary models (e.g. GPT-4o, Gemini-1.5-Pro, GPT-4-Turbo), while their open-source counterparts show an even larger performance gap. In addition, our results indicate that model performance on the benchmark improves only when they are capable of processing more frames, positioning LongVideoBench as a valuable benchmark for evaluating future-generation long-context LMMs.

翻译：大型多模态模型（LMMs）正在处理越来越长、内容越来越丰富的输入。尽管取得了进展，但可用于衡量这种发展的公共基准测试却很少。为了弥补这一差距，我们引入了LongVideoBench，这是一个问答基准测试，其特点是视频-语言交错输入的长度可达一小时。我们的基准测试包含3,763个不同长度的网络收集视频及其字幕，涵盖多样主题，旨在全面评估LMMs在长期多模态理解方面的能力。为此，我们将主要挑战解读为从长输入中准确检索并对详细的多模态信息进行推理。因此，我们制定了一种新颖的视频问答任务，称为指代推理。具体来说，作为问题的一部分，它包含一个指代查询，该查询引用了相关的视频上下文，称为被指代上下文。然后，模型需要从被指代上下文中对相关的视频细节进行推理。遵循指代推理范式，我们精心策划了6,678个人工标注的多选题，涵盖17个细粒度类别，从而建立了目前最全面的长视频理解基准测试之一。评估结果表明，LongVideoBench即使对最先进的专有模型（例如GPT-4o、Gemini-1.5-Pro、GPT-4-Turbo）也构成了重大挑战，而它们的开源对应模型则表现出更大的性能差距。此外，我们的结果表明，只有当模型能够处理更多帧时，它们在基准测试上的性能才会提高，这使LongVideoBench成为评估下一代长上下文LMMs的一个有价值的基准测试。