We introduce AudioBench, a new benchmark designed to evaluate audio large language models (AudioLLMs). AudioBench encompasses 8 distinct tasks and 26 carefully selected or newly curated datasets, focusing on speech understanding, voice interpretation, and audio scene understanding. Despite the rapid advancement of large language models, including multimodal versions, a significant gap exists in comprehensive benchmarks for thoroughly evaluating their capabilities. AudioBench addresses this gap by providing relevant datasets and evaluation metrics. In our study, we evaluated the capabilities of four models across various aspects and found that no single model excels consistently across all tasks. We outline the research outlook for AudioLLMs and anticipate that our open-source code, data, and leaderboard will offer a robust testbed for future model developments.
翻译:我们提出了AudioBench,这是一个用于评估音频大语言模型(AudioLLMs)的新基准测试。AudioBench涵盖8个不同的任务和26个精心选择或新构建的数据集,重点关注语音理解、语音解释和音频场景理解。尽管大语言模型(包括多模态版本)发展迅速,但目前仍缺乏能够全面评估其能力的综合性基准测试。AudioBench通过提供相关数据集和评估指标来填补这一空白。在我们的研究中,我们评估了四个模型在多个方面的能力,发现没有单一模型能在所有任务中持续表现出色。我们概述了AudioLLMs的研究前景,并期待我们的开源代码、数据和排行榜能为未来的模型开发提供一个坚实的测试平台。