Recent advancements in integrating Large Language Models (LLM) with automatic speech recognition (ASR) have performed remarkably in general domains. While supervised fine-tuning (SFT) of all model parameters is often employed to adapt pre-trained LLM-based ASR models to specific domains, it imposes high computational costs and notably reduces their performance in general domains. In this paper, we propose a novel parameter-efficient multi-domain fine-tuning method for adapting pre-trained LLM-based ASR models to multi-accent domains without catastrophic forgetting named \textit{HDMoLE}, which leverages hierarchical routing and dynamic thresholds based on combining low-rank adaptation (LoRA) with the mixer of experts (MoE) and can be generalized to any linear layer. Hierarchical routing establishes a clear correspondence between LoRA experts and accent domains, improving cross-domain collaboration among the LoRA experts. Unlike the static Top-K strategy for activating LoRA experts, dynamic thresholds can adaptively activate varying numbers of LoRA experts at each MoE layer. Experiments on the multi-accent and standard Mandarin datasets demonstrate the efficacy of HDMoLE. Applying HDMoLE to an LLM-based ASR model projector module achieves similar performance to full fine-tuning in the target multi-accent domains while using only 9.6% of the trainable parameters required for full fine-tuning and minimal degradation in the source general domain.
翻译:近年来,将大语言模型(LLM)与自动语音识别(ASR)相结合的研究在通用领域取得了显著进展。尽管通常采用全参数监督微调(SFT)来使预训练的基于LLM的ASR模型适应特定领域,但这种方法计算成本高昂,且会显著降低模型在通用领域的性能。本文提出一种新颖的参数高效多领域微调方法,用于使预训练的基于LLM的ASR模型适应多口音领域,同时避免灾难性遗忘。该方法名为 \textit{HDMoLE},其核心在于将低秩自适应(LoRA)与专家混合(MoE)相结合,并引入分层路由和动态阈值机制,可泛化至任何线性层。分层路由在LoRA专家与口音领域之间建立了清晰的对应关系,从而提升了LoRA专家间的跨领域协作能力。与静态的Top-K激活策略不同,动态阈值能够在每个MoE层自适应地激活不同数量的LoRA专家。在多口音和标准普通话数据集上的实验验证了HDMoLE的有效性。将HDMoLE应用于基于LLM的ASR模型的投影器模块,在目标多口音领域达到了与全参数微调相当的性能,而可训练参数量仅需全微调的9.6%,且在源通用领域性能下降极小。