Large Audio Language Models (LALMs) have expanded the interaction with human to speech modality, which introduces great interactive potential, due to the paralinguistic cues implicitly indicating the user context. However, building on the current content-centred paradigm, LALMs usually neglect such paralinguistic cues and respond solely based on query content. In this work, to resurface the paralinguistic awareness in LALMs, we introduce five diverse layer-wise analyses to jointly identify paralinguistic layers and semantic understanding layers. Based on these insights, we propose a paralinguistic-enhanced fine-tuning (PE-FT) protocol accordingly to equip LALMs with paralinguistic-aware capabilities, including (1) selective-layer fine-tuning, and (2) an auxiliary dual-level classification head. Our experiments demonstrate that PE-FT protocol efficiently and effectively resurfaces the paralinguistic awareness, even surpassing the performance of the all-layer fine-tuning strategy.
翻译:大型音频语言模型(LALMs)将人机交互扩展至语音模态,由于副语言线索能够隐式指示用户情境,这带来了巨大的交互潜力。然而,基于当前以内容为中心的范式,LALMs通常忽略此类副语言线索,仅依据查询内容进行响应。在本研究中,为重新激活LALMs中的副语言意识,我们引入了五种不同的层级分析方法来联合识别副语言层与语义理解层。基于这些发现,我们相应提出了一种副语言增强微调(PE-FT)方案,使LALMs具备副语言感知能力,包括:(1)选择性层级微调,以及(2)辅助性双级分类头。实验表明,PE-FT方案能够高效且有效地重新激活副语言意识,其性能甚至超越了全层级微调策略。