Large audio language models (LALMs) are a class of foundation models for audio understanding. Existing LALMs tend to degrade significantly in real-world noisy acoustic conditions where speech and non-speech sounds interfere. While noise-aware fine-tuning can improve robustness, it requires task-specific noisy data and expensive retraining, limiting scalability. To address this issue, we propose Focus-Then-Listen (FTL), a plug-and-play audio enhancer that improves LALMs' noise robustness. Specifically, FTL first separates the input waveform into speech and non-speech, and a modality router is applied to predict the target audio modality (e.g., speech) based on the user's instruction. Finally, a modality-aware fusion block generates a task-adaptive enhanced signal for improved downstream perception and reasoning. Experiments across multiple LALMs and tasks show that FTL improves performance across different noise levels without fine-tuning on LALMs.
翻译:大型音频语言模型是一类用于音频理解的基础模型。现有大型音频语言模型在语音与非语音相互干扰的现实嘈杂声学环境中性能显著退化。虽然噪声感知微调可提升鲁棒性,但需要特定任务的噪声数据和昂贵的重新训练,限制了可扩展性。为解决此问题,我们提出聚焦再聆听——一种即插即用的音频增强器,可提升大型音频语言模型的噪声鲁棒性。具体而言,聚焦再聆听首先将输入波形分离为语音与非语音信号,并通过模态路由器根据用户指令预测目标音频模态(如语音)。最后,模态感知融合模块生成任务自适应的增强信号,以改进下游感知与推理性能。跨多种大型音频语言模型与任务的实验表明,聚焦再聆听无需对大型音频语言模型进行微调,即可在不同噪声水平下提升性能。