In this paper, we introduce a speech-conditioned Large Language Model (LLM) integrated with a Mixture of Experts (MoE) based connector to address the challenge of Code-Switching (CS) in Automatic Speech Recognition (ASR). Specifically, we propose an Insertion and Deletion of Interruption Token (IDIT) mechanism for better transfer text generation ability of LLM to speech recognition task. We also present a connecter with MoE architecture that manages multiple languages efficiently. To further enhance the collaboration of multiple experts and leverage the understanding capabilities of LLM, we propose a two-stage progressive training strategy: 1) The connector is unfrozen and trained with language-specialized experts to map speech representations to the text space. 2) The connector and LLM LoRA adaptor are trained with the proposed IDIT mechanism and all experts are activated to learn general representations. Experimental results demonstrate that our method significantly outperforms state-of-the-art models, including end-to-end and large-scale audio-language models.
翻译:本文提出了一种集成专家混合连接器的语音条件大语言模型,以应对自动语音识别中的代码切换挑战。具体而言,我们设计了一种中断令牌插入与删除机制,以更好地将LLM的文本生成能力迁移至语音识别任务。同时,我们提出了一种采用MoE架构的连接器,用于高效管理多语言信息。为进一步增强多专家协作并充分利用LLM的理解能力,我们设计了一种两阶段渐进式训练策略:1)解冻连接器并与语言专用专家共同训练,将语音表征映射到文本空间;2)连接器与LLM LoRA适配器通过所提IDIT机制进行训练,并激活所有专家以学习通用表征。实验结果表明,我们的方法在性能上显著超越包括端到端模型与大规模音频-语言模型在内的现有最优模型。