We introduce AudioBench, a universal benchmark designed to evaluate Audio Large Language Models (AudioLLMs). It encompasses 8 distinct tasks and 26 datasets, among which, 7 are newly proposed datasets. The evaluation targets three main aspects: speech understanding, audio scene understanding, and voice understanding (paralinguistic). Despite recent advancements, there lacks a comprehensive benchmark for AudioLLMs on instruction following capabilities conditioned on audio signals. AudioBench addresses this gap by setting up datasets as well as desired evaluation metrics. Besides, we also evaluated the capabilities of five popular models and found that no single model excels consistently across all tasks. We outline the research outlook for AudioLLMs and anticipate that our open-sourced evaluation toolkit, data, and leaderboard will offer a robust testbed for future model developments.
翻译:本文介绍了AudioBench,一个专为评估音频大语言模型(AudioLLMs)设计的通用基准。该基准涵盖8项不同任务和26个数据集,其中包含7个新提出的数据集。评估主要针对三个方面:语音理解、音频场景理解以及语音副语言信息理解。尽管近期研究取得了进展,但目前仍缺乏一个基于音频信号指令跟随能力的AudioLLMs综合性基准。AudioBench通过构建数据集及制定相应评估指标来填补这一空白。此外,我们对五种主流模型进行了能力评估,发现没有单一模型能在所有任务中保持持续优势。本文展望了AudioLLMs的研究前景,并期待我们开源的评价工具包、数据集及排行榜能为未来模型发展提供坚实的测试平台。