We introduce AudioBench, a new benchmark designed to evaluate audio large language models (AudioLLMs). AudioBench encompasses 8 distinct tasks and 26 carefully selected or newly curated datasets, focusing on speech understanding, voice interpretation, and audio scene understanding. Despite the rapid advancement of large language models, including multimodal versions, a significant gap exists in comprehensive benchmarks for thoroughly evaluating their capabilities. AudioBench addresses this gap by providing relevant datasets and evaluation metrics. In our study, we evaluated the capabilities of four models across various aspects and found that no single model excels consistently across all tasks. We outline the research outlook for AudioLLMs and anticipate that our open-source code, data, and leaderboard will offer a robust testbed for future model developments.
翻译:我们提出了AudioBench,这是一个专为评估音频大语言模型(AudioLLMs)而设计的新基准。AudioBench涵盖了8个不同的任务和26个精心挑选或新近整理的数据集,重点关注语音理解、语音解释和音频场景理解。尽管大语言模型(包括多模态版本)发展迅速,但在全面评估其能力的基准方面仍存在显著差距。AudioBench通过提供相关的数据集和评估指标来弥补这一不足。在我们的研究中,我们评估了四个模型在多个方面的能力,发现没有一个模型能在所有任务中持续表现出色。我们概述了AudioLLMs的研究前景,并期待我们的开源代码、数据和排行榜能为未来的模型开发提供一个强有力的测试平台。