AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?

Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro, and Reka Core, have expanded their capabilities to include vision and audio modalities. While these models demonstrate impressive performance across a wide range of audio-visual applications, our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial: 1) determining which of two sounds is louder, and 2) determining which of two sounds has a higher pitch. Motivated by these observations, we introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information. This benchmark encompasses 4,555 carefully crafted problems, each incorporating text, visual, and audio components. To successfully infer answers, models must effectively leverage clues from both visual and audio inputs. To ensure precise and objective evaluation of MLLM responses, we have structured the questions as multiple-choice, eliminating the need for human evaluation or LLM-assisted assessment. We benchmark a series of closed-source and open-source models and summarize the observations. By revealing the limitations of current models, we aim to provide useful insight for future dataset collection and model development.

翻译：近期，诸如GPT-4o、Gemini 1.5 Pro和Reka Core等多模态大语言模型（MLLMs）已将其能力扩展至视觉与音频模态。尽管这些模型在广泛的视听应用中展现出令人印象深刻的性能，我们提出的DeafTest揭示，MLLMs在处理人类认为微不足道的简单任务时常常遇到困难：1）判断两个声音中哪一个更响亮，以及2）判断两个声音中哪一个音高更高。受这些观察启发，我们提出了AV-Odyssey Bench，一个全面的视听基准测试，旨在评估这些MLLMs是否真正理解视听信息。该基准包含4,555个精心设计的问题，每个问题均融合了文本、视觉和音频组件。为了成功推断答案，模型必须有效利用来自视觉和音频输入的线索。为确保对MLLM响应的精确且客观的评估，我们将问题构建为多项选择题，从而无需人工评估或LLM辅助评估。我们对一系列闭源和开源模型进行了基准测试并总结了观察结果。通过揭示当前模型的局限性，我们旨在为未来的数据集收集和模型开发提供有益的见解。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日