Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini 1.5 Pro, and Reka Core, have expanded their capabilities to include vision and audio modalities. While these models demonstrate impressive performance across a wide range of audio-visual applications, our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial: 1) determining which of two sounds is louder, and 2) determining which of two sounds has a higher pitch. Motivated by these observations, we introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information. This benchmark encompasses 4,555 carefully crafted problems, each incorporating text, visual, and audio components. To successfully infer answers, models must effectively leverage clues from both visual and audio inputs. To ensure precise and objective evaluation of MLLM responses, we have structured the questions as multiple-choice, eliminating the need for human evaluation or LLM-assisted assessment. We benchmark a series of closed-source and open-source models and summarize the observations. By revealing the limitations of current models, we aim to provide useful insight for future dataset collection and model development.
翻译:近期,诸如GPT-4o、Gemini 1.5 Pro和Reka Core等多模态大语言模型(MLLMs)已将其能力扩展至视觉与音频模态。尽管这些模型在广泛的视听应用中展现出令人印象深刻的性能,我们提出的DeafTest揭示,MLLMs在处理人类认为微不足道的简单任务时常常遇到困难:1)判断两个声音中哪一个更响亮,以及2)判断两个声音中哪一个音高更高。受这些观察启发,我们提出了AV-Odyssey Bench,一个全面的视听基准测试,旨在评估这些MLLMs是否真正理解视听信息。该基准包含4,555个精心设计的问题,每个问题均融合了文本、视觉和音频组件。为了成功推断答案,模型必须有效利用来自视觉和音频输入的线索。为确保对MLLM响应的精确且客观的评估,我们将问题构建为多项选择题,从而无需人工评估或LLM辅助评估。我们对一系列闭源和开源模型进行了基准测试并总结了观察结果。通过揭示当前模型的局限性,我们旨在为未来的数据集收集和模型开发提供有益的见解。