Frontier models have either been language-only or have primarily focused on vision and language modalities. Although recent advancements in models with vision and audio understanding capabilities have shown substantial progress, the field lacks a standardized evaluation framework for thoroughly assessing their cross-modality perception performance. We introduce MAVERIX~(Multimodal Audio-Visual Evaluation Reasoning IndeX), a novel benchmark with 700 videos and 2,556 questions explicitly designed to evaluate multimodal models through tasks that necessitate close integration of video and audio information. MAVERIX uniquely provides models with audiovisual tasks, closely mimicking the multimodal perceptual experiences available to humans during inference and decision-making processes. To our knowledge, MAVERIX is the first benchmark aimed explicitly at assessing comprehensive audiovisual integration. Experiments with state-of-the-art models, including Gemini 1.5 Pro and o1, show performance approaching human levels (around 70% accuracy), while human experts reach near-ceiling performance (95.1%). With standardized evaluation protocols, a rigorously annotated pipeline, and a public toolkit, MAVERIX establishes a challenging testbed for advancing audiovisual multimodal intelligence.
翻译:前沿模型要么仅支持语言模态,要么主要聚焦于视觉与语言模态。尽管近期具备视觉与音频理解能力的模型取得了显著进展,该领域仍缺乏用于系统评估其跨模态感知性能的标准化评测框架。本文提出MAVERIX(多模态视听评估推理指数)——一个包含700个视频与2,556个问题的新型基准测试集,其问题设计明确要求模型通过紧密整合视频与音频信息来完成评估任务。MAVERIX独特地为模型提供视听双模态任务,高度模拟人类在推理与决策过程中可获得的多模态感知体验。据我们所知,MAVERIX是首个专门针对综合性视听整合能力评估而设计的基准测试。通过对包括Gemini 1.5 Pro和o1在内的前沿模型进行实验,发现其性能已接近人类水平(准确率约70%),而人类专家则达到接近上限的表现(95.1%)。凭借标准化的评估协议、严格标注的数据流程及公开工具包,MAVERIX为推进视听多模态智能研究建立了具有挑战性的测试平台。