Large language models (LLMs) have shown exceptional versatility in natural language processing, prompting recent efforts to extend their multimodal capabilities to speech processing through the development of audio large language models (Audio LLMs). While Audio LLMs excel in tasks such as speech recognition and synthesis, it remains unclear how they perform when faced with the auditory cognitive challenges posed by real-world environments, such as audio comprehension and listening recall, particularly in the presence of background noise or overlapping speech. Unlike text-based LLMs, which have access to vast amounts of text data for pre-training, retraining Audio LLMs with diverse auditory cognitive scenes is difficult due to the limited datasets that simulate real-world auditory cognitive scenarios and the challenge of acquiring auditory cognitive labels for training. While test-time compute (TTC) methods have been shown to enhance the capabilities of text-based LLMs during inference, a key challenge lies in designing these TTC methods to improve the auditory capabilities of Audio LLMs. This study aims to address these two research gaps by: i) exploring the auditory cognitive capabilities of Audio LLMs, and ii) enhancing their capabilities using TTC approaches. We have investigated five different Audio LLMs for auditory cognition using a \textit{self-collected} database and have proposed five TTC approaches to enhance auditory cognitive capabilities during inference. Our findings reveal that Audio LLMs performance decreases in more challenging auditory cognitive tasks. The proposed TTC approaches significantly enhance cognitive auditory capabilities, advancing the development of more adaptable and resilient Audio LLMs for practical applications such as assistive listening devices, voice-based AI assistants, and communication technologies.
翻译:大型语言模型(LLM)在自然语言处理领域展现出卓越的泛化能力,这促使近期研究致力于通过开发音频大语言模型(Audio LLM)将其多模态能力扩展至语音处理领域。尽管Audio LLM在语音识别与合成等任务中表现优异,但其在面临真实环境中的听觉认知挑战(如音频理解与听忆任务)时的性能尚不明确,尤其是在存在背景噪声或重叠语音的情况下。与基于文本的LLM能够利用海量文本数据进行预训练不同,由于模拟真实听觉认知场景的数据集有限,且获取用于训练的听觉认知标签存在困难,使用多样化的听觉认知场景重新训练Audio LLM具有挑战性。虽然测试时计算(TTC)方法已被证明能在推理阶段增强基于文本的LLM的能力,但如何设计这些TTC方法来提升Audio LLM的听觉能力仍是一个关键难题。本研究旨在通过以下两方面解决上述研究空白:i) 探索Audio LLM的听觉认知能力;ii) 利用TTC方法增强其能力。我们使用一个\textit{自收集}数据库对五种不同的Audio LLM进行了听觉认知研究,并提出了五种TTC方法以在推理阶段提升听觉认知能力。我们的研究结果表明,Audio LLM在更具挑战性的听觉认知任务中性能下降。所提出的TTC方法显著增强了认知听觉能力,推动了面向实际应用(如助听设备、语音AI助手及通信技术)的更具适应性与鲁棒性的Audio LLM的发展。