Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models

Large Audio-Language Models (LALMs) have unclocked audio dialogue capabilities, where audio dialogues are a direct exchange of spoken language between LALMs and humans. Recent advances, such as GPT-4o, have enabled LALMs in back-and-forth audio dialogues with humans. This progression not only underscores the potential of LALMs but also broadens their applicability across a wide range of practical scenarios supported by audio dialogues. However, given these advancements, a comprehensive benchmark to evaluate the performance of LALMs in the open-ended audio dialogue understanding remains absent currently. To address this gap, we propose an Audio Dialogue Understanding Benchmark (ADU-Bench), which consists of 4 benchmark datasets. They assess the open-ended audio dialogue ability for LALMs in 3 general scenarios, 12 skills, 9 multilingual languages, and 4 categories of ambiguity handling. Notably, we firstly propose the evaluation of ambiguity handling in audio dialogues that expresses different intentions beyond the same literal meaning of sentences, e.g., "Really!?" with different intonations. In summary, ADU-Bench includes over 20,000 open-ended audio dialogues for the assessment of LALMs. Through extensive experiments conducted on 13 LALMs, our analysis reveals that there is still considerable room for improvement in the audio dialogue understanding abilities of existing LALMs. In particular, they struggle with mathematical symbols and formulas, understanding human behavior such as roleplay, comprehending multiple languages, and handling audio dialogue ambiguities from different phonetic elements, such as intonations, pause positions, and homophones.

翻译：大规模音频-语言模型已展现出音频对话能力，其中音频对话指模型与人类之间直接进行的口语交流。以GPT-4o为代表的近期进展使得大规模音频-语言模型能够与人类开展多轮音频对话。这一进展不仅凸显了模型的潜力，更通过音频对话支持拓展了其在众多实际场景中的应用范围。然而，随着技术发展，目前仍缺乏系统评估大规模音频-语言模型在开放域音频对话理解性能的综合基准。为填补这一空白，我们提出了音频对话理解评测基准，该基准包含4个评测数据集，从3类通用场景、12项技能、9种多语言及4类歧义处理维度评估大规模音频-语言模型的开放域音频对话能力。值得关注的是，我们首次提出对音频对话中歧义处理的评估机制，此类歧义表现为相同字面含义的语句通过不同语音特征传达相异意图，例如通过不同语调表达的"真的吗！？"。总体而言，本基准涵盖超过20,000条开放域音频对话用于模型评估。通过对13个大规模音频-语言模型的系统性实验分析表明，现有模型在音频对话理解能力方面仍存在显著提升空间。特别是在数学符号与公式理解、角色扮演等人类行为理解、多语言理解，以及处理由语调、停顿位置、同音异义词等不同语音要素引发的音频对话歧义方面，现有模型仍面临严峻挑战。