Recent advances in video large language models (VideoLLMs) call for new evaluation protocols and benchmarks for complex temporal reasoning in video-language understanding. Inspired by the needle in a haystack test widely used by LLMs, we introduce a novel task of Needle in a Montage (NeMo), designed to assess VideoLLMs' critical reasoning capabilities, including long-context recall and temporal grounding. To generate video question answering data for our task, we develop a scalable automated data generation pipeline that facilitates high-quality data synthesis. Built upon the proposed pipeline, we present NeMoBench, a video-language benchmark centered on our task. Specifically, our full set of NeMoBench features 31,378 automatically generated question-answer (QA) pairs from 13,486 videos with various durations ranging from seconds to hours. Experiments demonstrate that our pipeline can reliably and automatically generate high-quality evaluation data, enabling NeMoBench to be continuously updated with the latest videos. We evaluate 20 state-of-the-art models on our benchmark, providing extensive results and key insights into their capabilities and limitations. Our project page is available at: https://lavi-lab.github.io/NeMoBench.
翻译:近期视频大语言模型(VideoLLMs)的进展要求为视频语言理解中的复杂时序推理建立新的评估协议与基准。受大语言模型广泛采用的“大海捞针”测试启发,我们提出了一项新颖的“蒙太奇中的寻针”(NeMo)任务,旨在评估VideoLLMs的关键推理能力,包括长上下文记忆与时间定位。为生成适用于本任务的视频问答数据,我们开发了一个可扩展的自动化数据生成流程,以促进高质量数据的合成。基于该流程,我们构建了NeMoBench——一个围绕本任务设计的视频语言基准。具体而言,完整的NeMoBench包含来自13,486段时长从数秒到数小时不等的视频中自动生成的31,378对问答(QA)数据。实验表明,我们的流程能够可靠且自动地生成高质量评估数据,使NeMoBench能够持续更新最新视频内容。我们在该基准上评估了20个前沿模型,提供了详尽的测试结果及对其能力与局限性的关键洞见。项目页面详见:https://lavi-lab.github.io/NeMoBench。