Understanding videos inherently requires reasoning over both visual and auditory information. To properly evaluate Omni-Large Language Models (Omni-LLMs), which are capable of processing multi-modal information including vision and audio, an effective benchmark must comprehensively cover three key aspects: (1) multi-modal dependency (i.e., questions that cannot be answered using vision or audio alone), (2) diverse audio information types (e.g., speech, sound events), and (3) varying scene spans. However, existing datasets fall short in one or more of these dimensions, limiting strict and comprehensive evaluation. To address this gap, we introduce JointAVBench, a novel benchmark with strict audio-video correlation, spanning five cognitive dimensions, four audio information types (speech, sound events, music, vocal traits), and three scene spans (single-, cross-, and full-scene). Given the high cost of manual annotation, we propose an automated pipeline that leverages state-of-the-art vision-LLMs, audio-LLMs, and general-purpose LLMs to synthesize questions and answers that strictly require joint audio-visual understanding. We evaluate leading vision-only, audio-only, and Omni-LLMs on our dataset. Results show that even the best-performing Omni-LLM achieves an average accuracy of only 65.3\%, outperforming uni-modal baselines but revealing substantial room for improvement, especially in cross-scene reasoning.
翻译:理解视频本质上需要同时处理视觉与听觉信息。为全面评估能够处理包括视觉和音频在内的多模态信息的全模态大语言模型(Omni-LLMs),有效的基准必须覆盖三个关键维度:(1)多模态依赖性(即仅凭视觉或音频无法回答的问题)、(2)多样化的音频信息类型(如语音、声音事件)以及(3)不同的场景跨度。然而,现有数据集在这些维度上存在一个或多个不足,限制了严格且全面的评估。为解决这一问题,我们提出JointAVBench——一个具有严格音视频关联性的新型基准,涵盖五个认知维度、四种音频信息类型(语音、声音事件、音乐、嗓音特征)以及三种场景跨度(单场景、跨场景、全场景)。鉴于人工标注成本高昂,我们提出自动化流水线,利用最先进的视觉大语言模型(vision-LLMs)、音频大语言模型(audio-LLMs)及通用大语言模型合成必须依赖联合视听理解的问题与答案。我们在该数据集上评估了主流纯视觉、纯音频及全模态大语言模型。结果表明,即使性能最优的全模态大语言模型平均准确率也仅达65.3%,虽然优于单模态基线,但暴露出尤其在跨场景推理方面仍有显著提升空间。