Evaluating the nuanced human-centric video understanding capabilities of Multimodal Large Language Models (MLLMs) remains a great challenge, as existing benchmarks often overlook the intricacies of emotion, behavior, and cross-modal alignment. We introduce HumanVBench, a comprehensive video benchmark designed to rigorously probe these capabilities across 16 fine-grained tasks. A cornerstone of our work is a novel and scalable benchmark construction methodology, featuring two automated pipelines that synthesize high-quality video annotations and challenging multiple-choice questions with minimal human labor. By leveraging state-of-the-art models for annotation and systematically converting model-induced errors into plausible distractors, our framework provides a generalizable ``machine'' for creating nuanced evaluation suites. Our extensive evaluation of 30 leading MLLMs on HumanVBench reveals critical deficiencies, particularly in perceiving subtle emotions and aligning speech with visual cues, with even top proprietary models falling short of human performance. We open-source HumanVBench and our synthesis pipelines to catalyze the development of more socially intelligent and capable video MLLMs.
翻译:评估多模态大语言模型(MLLMs)对人类中心视频的精细理解能力仍是一项重大挑战。现有基准测试往往忽略情感、行为及跨模态对齐的复杂性。我们提出HumanVBench——一个涵盖16项细粒度任务的综合性视频基准,旨在严格评测这些能力。本研究的核心在于提出一种新颖且可扩展的基准构建方法,通过两条自动化流水线合成高质量视频标注及具有挑战性的多项选择题,仅需极少人工干预。该框架利用最先进的模型进行标注,并系统性地将模型诱导错误转化为合理的干扰选项,从而提供一种通用的“机器”来生成精细评估数据集。我们在HumanVBench上对30个领先MLLMs的广泛评估揭示了关键缺陷,尤其在感知微妙情感及将语音与视觉线索对齐方面,即使顶级商用模型也未达到人类表现。我们开源HumanVBench及其合成流水线,以推动更具社会智能的视频多模态大语言模型的发展。