Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. This dataset comprises high-quality videos sourced from open platforms that feature rich audio-visual dynamics. Through rigorous manual selection and annotation, LVOmniBench comprises 275 videos, ranging in duration from 10 to 90 minutes, and 1,014 question-answer (QA) pairs. LVOmniBench aims to rigorously evaluate the capabilities of OmniLLMs across domains, including long-term memory, temporal localization, fine-grained understanding, and multimodal perception. Our extensive evaluation reveals that current OmniLLMs encounter significant challenges when processing extended audio-visual inputs. Open-source models generally achieve accuracies below 35%, whereas the Gemini 3 Pro reaches a peak accuracy of approximately 65%. We anticipate that this dataset, along with our empirical findings, will stimulate further research and the development of advanced models capable of resolving complex cross-modal understanding problems within long-form audio-visual contexts.
翻译:近期全模态大语言模型(OmniLLMs)的进展显著提升了对音频和视频输入的理解能力。然而,现有评测主要聚焦于10秒至5分钟的短音视频片段,未能反映典型时长可达数十分钟的真实应用需求。为填补这一关键空白,我们提出LVOmniBench——一个专为长形式音视频跨模态理解设计的新型基准。该数据集包含来自开放平台的高质量视频,具有丰富的音视频动态特征。通过严格的人工筛选与标注,LVOmniBench包含275个时长10-90分钟的视频及1,014组问答对。该基准旨在系统评估全模态大语言模型在长时记忆、时间定位、细粒度理解与多模态感知等维度的能力。广泛评估表明,现有全模态大语言模型在处理长音视频输入时面临显著挑战:开源模型准确率普遍低于35%,而Gemini 3 Pro达到约65%的最高准确率。我们预期本数据集及其实验发现将推动针对长时音视频复杂跨模态理解问题的先进模型研发。