Recently, instruction-following audio-language models have received broad attention for human-audio interaction. However, the absence of benchmarks capable of evaluating audio-centric interaction capabilities has impeded advancements in this field. Previous models primarily focus on assessing different fundamental tasks, such as Automatic Speech Recognition (ASR), and lack an assessment of the open-ended generative capabilities centered around audio. Thus, it is challenging to track the progression in the Large Audio-Language Models (LALMs) domain and to provide guidance for future improvement. In this paper, we introduce AIR-Bench (\textbf{A}udio \textbf{I}nst\textbf{R}uction \textbf{Bench}mark), the first benchmark designed to evaluate the ability of LALMs to understand various types of audio signals (including human speech, natural sounds, and music), and furthermore, to interact with humans in the textual format. AIR-Bench encompasses two dimensions: \textit{foundation} and \textit{chat} benchmarks. The former consists of 19 tasks with approximately 19k single-choice questions, intending to inspect the basic single-task ability of LALMs. The latter one contains 2k instances of open-ended question-and-answer data, directly assessing the comprehension of the model on complex audio and its capacity to follow instructions. Both benchmarks require the model to generate hypotheses directly. We design a unified framework that leverages advanced language models, such as GPT-4, to evaluate the scores of generated hypotheses given the meta-information of the audio. Experimental results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation. By revealing the limitations of existing LALMs through evaluation results, AIR-Bench can provide insights into the direction of future research.
翻译:近年来,遵循指令的音频-语言模型在人机音频交互领域受到广泛关注。然而,由于缺乏能够评估以音频为中心的交互能力的基准,该领域的发展受到阻碍。以往的模型主要侧重于评估不同的基础任务,例如自动语音识别(ASR),而缺乏对围绕音频的开放式生成能力的评估。因此,追踪大型音频-语言模型(LALMs)领域的进展并为未来改进提供指导具有挑战性。本文介绍了AIR-Bench(音频指令评测基准),这是首个旨在评估LALMs理解各类音频信号(包括人类语音、自然声音和音乐)能力,并进一步评估其以文本形式与人类交互能力的基准。AIR-Bench包含两个维度:基础评测和对话评测。前者包含19个任务,约1.9万个单项选择题,旨在检验LALMs的基本单任务能力。后者包含2千个开放式问答数据实例,直接评估模型对复杂音频的理解及其遵循指令的能力。两项基准均要求模型直接生成假设。我们设计了一个统一框架,利用GPT-4等先进语言模型,根据音频的元信息评估生成假设的得分。实验结果表明,基于GPT-4的评估与人工评估具有高度一致性。通过评估结果揭示现有LALMs的局限性,AIR-Bench可为未来研究方向提供见解。