Connecting audio encoders with large language models (LLMs) allows the LLM to perform various audio understanding tasks, such as automatic speech recognition (ASR) and audio captioning (AC). Most research focuses on training an adapter layer to generate a unified audio feature for the LLM. However, different tasks may require distinct features that emphasize either semantic or acoustic aspects, making task-specific audio features more desirable. In this paper, we propose Prompt-aware Mixture (PaM) to enhance the Speech LLM that uses multiple audio encoders. Our approach involves using different experts to extract different features based on the prompt that indicates different tasks. Experiments demonstrate that with PaM, only one Speech LLM surpasses the best performances achieved by all single-encoder Speech LLMs on ASR, Speaker Number Verification, and AC tasks. PaM also outperforms other feature fusion baselines, such as concatenation and averaging.
翻译:将音频编码器与大语言模型(LLM)连接,使得LLM能够执行多种音频理解任务,例如自动语音识别(ASR)和音频描述(AC)。大多数研究集中于训练一个适配器层,为LLM生成统一的音频特征。然而,不同的任务可能需要强调语义或声学方面的不同特征,这使得任务特定的音频特征更为理想。在本文中,我们提出了提示感知混合(PaM)来增强使用多个音频编码器的语音LLM。我们的方法涉及基于指示不同任务的提示,使用不同的专家来提取不同的特征。实验表明,通过PaM,仅一个语音LLM在ASR、说话人数量验证和AC任务上的性能就超越了所有单编码器语音LLM的最佳表现。PaM也优于其他特征融合基线方法,如拼接和平均。