Video Large Language Models (Video-LLMs) are improving rapidly, yet current Video Question Answering (VideoQA) benchmarks often admit single-cue shortcuts, under-testing reasoning that must integrate evidence across time. We introduce HERBench, a benchmark designed to make multi-evidence integration unavoidable: each question requires at least three non-overlapping cues drawn from distinct video segments. HERBench contains 26,806 five-way multiple-choice questions across 12 compositional tasks. To make evidential demand measurable, we introduce the Minimum Required Frame-Set (MRFS), the smallest number of frames a model must fuse to answer correctly, and show that HERBench imposes higher evidential demand than prior benchmarks. Evaluating 13 state-of-the-art Video-LLMs yields only 31-42% accuracy, only modestly above the 20\% random-guess baseline. We disentangle this failure into two critical bottlenecks: (1) a retrieval deficit, where frame selectors overlook key evidence, and (2) a fusion deficit, where models fail to integrate information even when all necessary evidence is provided. HERBench thus provides a principled benchmark for studying robust multi-evidence video understanding.
翻译:视频大型语言模型(Video-LLMs)正快速发展,然而当前视频问答(VideoQA)基准测试往往允许单线索捷径,未能充分测试需要跨时间融合证据的推理能力。我们提出HERBench,这是一个旨在使多证据融合成为必要条件的基准测试:每个问题至少需要从不同视频片段中提取三条非重叠线索。HERBench包含12个组合型任务中的26,806道五选一选择题。为使证据需求可量化,我们引入最小必需帧集(MRFS)——模型为正确回答问题必须融合的最小帧数——并证明HERBench比先前基准测试具有更高的证据需求。对13个前沿Video-LLMs的评估结果显示,其准确率仅为31-42%,仅略高于20%的随机猜测基线水平。我们将该性能不足归因于两个关键瓶颈:(1)检索缺陷——帧选择器遗漏关键证据,(2)融合缺陷——即使用于必要证据已全部提供,模型仍无法整合信息。因此,HERBench为研究鲁棒的多证据视频理解提供了系统的基准测试。