Various audio-LLMs (ALLMs) have been explored recently for tackling different audio tasks simultaneously using a single, unified model. While existing evaluations of ALLMs primarily focus on single-audio tasks, real-world applications often involve processing multiple audio streams simultaneously. To bridge this gap, we propose the first multi-audio evaluation (MAE) benchmark that consists of 20 datasets from 11 multi-audio tasks encompassing both speech and sound scenarios. Comprehensive experiments on MAE demonstrate that the existing ALLMs, while being powerful in comprehending primary audio elements in individual audio inputs, struggling to handle multi-audio scenarios. To this end, we propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios using discriminative learning on our proposed synthetic data. The results demonstrate that the proposed MALLM outperforms all baselines and achieves high data efficiency using synthetic data without requiring human annotations. The proposed MALLM opens the door for ALLMs towards multi-audio processing era and brings us closer to replicating human auditory capabilities in machines.
翻译:近期,多种音频大语言模型(ALLM)被探索用于通过单一统一模型同时处理不同的音频任务。虽然现有的ALLM评估主要集中于单音频任务,但实际应用通常涉及同时处理多个音频流。为弥补这一差距,我们提出了首个多音频评估(MAE)基准,该基准包含来自11个多音频任务的20个数据集,涵盖语音和声音场景。在MAE上的综合实验表明,现有的ALLM虽然在理解单个音频输入中的主要音频元素方面表现出色,但在处理多音频场景时仍面临困难。为此,我们提出了一种新颖的多音频大语言模型(MALLM),利用我们提出的合成数据进行判别性学习,以捕捉多个相似音频之间的音频上下文。结果表明,所提出的MALLM在所有基线方法中表现最优,并且无需人工标注,仅使用合成数据即可实现较高的数据效率。所提出的MALLM为ALLM开启了多音频处理时代的大门,使我们更接近在机器中复现人类听觉能力。