Recently, Large Audio Language Models (LALMs) have progressed rapidly, demonstrating their strong efficacy in universal audio understanding through cross-modal integration. To evaluate LALMs' audio understanding performance, researchers have proposed different benchmarks. However, key aspects for real-world interactions are underexplored in existing benchmarks, i.e., audio signals typically contain both speech and non-speech components, and energy levels of these components can vary significantly across different scenarios. Moreover, most benchmarks do not consider the joint understanding of speech, scene, and events within the same audio clip. In this work, we introduce SSEU-Bench, the first versatile audio understanding benchmark that explicitly accounts for energy differences between speech and non-speech audio, with both independent and joint understanding settings for speech, scene, and events. Furthermore, we demonstrate that some LALMs tend to underperform on certain tasks in a joint understanding setting. To address this issue, we introduce Chain-of-Thought, which effectively improves LALMs' joint audio understanding performance by decomposing complex tasks into simpler reasoning steps.
翻译:近年来,大型音频语言模型(LALMs)发展迅速,通过跨模态整合展现出强大的通用音频理解能力。为评估LALMs的音频理解性能,研究者已提出多种基准测试。然而,现有基准对现实交互中的关键要素探索不足:音频信号通常同时包含语音与非语音成分,且这些成分的能量级在不同场景下可能存在显著差异。此外,多数基准未考虑对同一音频片段中语音、场景与事件的联合理解。本研究提出首个通用音频理解基准SSEU-Bench,该基准明确考虑语音与非语音音频间的能量差异,并设置语音、场景与事件的独立理解及联合理解任务。进一步研究发现,部分LALMs在联合理解任务中的特定子任务上表现欠佳。为此,我们引入思维链方法,通过将复杂任务分解为简单推理步骤,有效提升了LALMs的联合音频理解性能。