While Audio Large Models (ALMs) have achieved remarkable proficiency, their robustness remains brittle in real-world deployment. Existing evaluations largely rely on synthetic Gaussian noise or simplistic single-source interference, failing to capture the intricate, multi-layered acoustic dynamics -- or ``Acoustic Ecology'' -- that characterize authentic physical environments. To bridge this ecological gap, we introduce \textbf{RSA-Bench}, a comprehensive robustness benchmark designed to stress-test ALLMs through high-fidelity auditory scene simulations. Unlike traditional methods, we construct evaluation samples by naturally superimposing diverse environmental soundscapes -- spanning \textit{Pasture}, \textit{Extreme Weather}, \textit{Classroom}, and \textit{Outdoors} -- onto clean speech signals across a spectrum of interference intensities. By evaluating models on six core tasks ranging from fundamental perception to complex reasoning, our study unveils three macro-level insights: \textbf{(I) The Perception-Cognition Gap:} Models maintain relative resilience in low-level recognition but suffer a \textbf{functional collapse} in high-order reasoning tasks under stress; \textbf{(II) Scenario Sensitivity:} ``Vocal-like'' interference (e.g., background laughter) proves significantly more destructive than mechanical noise, challenging the model's auditory attention mechanisms; and \textbf{(III) The Denoising Paradox:} Standard speech enhancement often exacerbates performance degradation, as ALLMs prove highly sensitive to the semantic distortions introduced by denoising artifacts.
翻译:尽管音频大模型(ALMs)已展现出卓越的能力,但其在真实世界部署中的鲁棒性仍然脆弱。现有评测主要依赖于合成的高斯噪声或简单的单源干扰,未能捕捉真实物理环境中复杂、多层次的声学动态——或称“声学生态”。为弥合这一生态鸿沟,我们引入了**RSA-Bench**,一个通过高保真听觉场景模拟对ALMs进行压力测试的综合性鲁棒性基准。与传统方法不同,我们通过将多样化的环境声景——涵盖*牧场*、*极端天气*、*教室*和*户外*——在多种干扰强度下自然地叠加到纯净语音信号上来构建评测样本。通过在从基础感知到复杂推理的六项核心任务上评估模型,我们的研究揭示了三个宏观层面的发现:**(I)感知-认知鸿沟:** 模型在低层次识别任务中保持相对韧性,但在压力下的高阶推理任务中遭遇**功能性崩溃**;**(II)场景敏感性:** “类人声”干扰(例如背景笑声)被证明比机械噪声更具破坏性,这对模型的听觉注意力机制构成了挑战;以及**(III)去噪悖论:** 标准的语音增强技术常常加剧性能下降,因为ALMs对去噪伪影引入的语义失真高度敏感。