SONIC-O1：一个用于评估多模态大语言模型音视频理解能力的真实世界基准 (SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding)

Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding. We release SONIC-O1 for reproducibility and research: Project page: https://vectorinstitute.github.io/sonic-o1/ Dataset: https://huggingface.co/datasets/vector-institute/sonic-o1 Github: https://github.com/vectorinstitute/sonic-o1 Leaderboard: https://huggingface.co/spaces/vector-institute/sonic-o1-leaderboard

翻译：多模态大语言模型（MLLMs）是近期人工智能研究的一个主要焦点。然而，先前的研究大多集中于静态图像理解，而模型处理时序音视频数据的能力仍未得到充分探索。这一差距凸显了需要一个高质量基准来系统评估MLLM在真实世界场景下的性能。我们提出了SONIC-O1，这是一个全面的、完全由人工验证的基准，涵盖13个真实世界对话领域，包含4,958个标注和人口统计元数据。SONIC-O1在关键任务上评估MLLMs，包括开放式摘要、多项选择题（MCQ）回答以及带有支持性理由（推理）的时间定位。在闭源和开源模型上的实验揭示了其局限性。虽然两个模型家族在MCQ准确率上的性能差距相对较小，但我们观察到在时间定位任务上，表现最佳的闭源模型与开源模型之间存在高达22.6%的性能差异。模型在不同人口统计群体上的性能进一步下降，表明模型行为存在持续性的差异。总体而言，SONIC-O1为基于时序的、社会鲁棒的多模态理解提供了一个开放的评估套件。我们发布SONIC-O1以促进可复现性和研究：项目页面：https://vectorinstitute.github.io/sonic-o1/ 数据集：https://huggingface.co/datasets/vector-institute/sonic-o1 Github：https://github.com/vectorinstitute/sonic-o1 排行榜：https://huggingface.co/spaces/vector-institute/sonic-o1-leaderboard