一种面向CTC多语言语音识别的语言无关分层LoRA-MoE架构 (A Language-Agnostic Hierarchical LoRA-MoE Architecture for CTC-based Multilingual ASR)

Large-scale multilingual ASR (mASR) models such as Whisper achieve strong performance but incur high computational and latency costs, limiting their deployment on resource-constrained edge devices. In this study, we propose a lightweight and language-agnostic multilingual ASR system based on a CTC architecture with domain adaptation. Specifically, we introduce a Language-agnostic Hierarchical LoRA-MoE (HLoRA) framework integrated into an mHuBERT-CTC model, enabling end-to-end decoding via LID-posterior-driven LoRA routing. The hierarchical design consists of a multilingual shared LoRA for learning language-invariant acoustic representations and language-specific LoRA experts for modeling language-dependent characteristics. The proposed routing mechanism removes the need for prior language identity information or explicit language labels during inference, achieving true language-agnostic decoding. Experiments on MSR-86K and the MLC-SLM 2025 Challenge datasets demonstrate that HLoRA achieves competitive performance with state-of-the-art two-stage inference methods using only single-pass decoding, significantly improving decoding efficiency for low-resource mASR applications.

翻译：诸如Whisper等大规模多语言语音识别模型虽能实现强劲性能，但存在计算与延迟成本高的问题，限制了其在资源受限边缘设备上的部署。本研究提出一种基于CTC架构并融入领域自适应技术的轻量级、语言无关的多语言语音识别系统。具体而言，我们设计了一种语言无关分层LoRA-MoE框架，将其集成至mHuBERT-CTC模型中，通过基于语言识别后验的LoRA路由实现端到端解码。该分层结构包含用于学习语言无关声学表征的多语言共享LoRA模块，以及用于建模语言相关特性的语言专属LoRA专家模块。所提出的路由机制在推理过程中无需预先获取语言身份信息或显式语言标签，实现了真正的语言无关解码。在MSR-86K数据集和MLC-SLM 2025挑战赛数据集上的实验表明，HLoRA仅通过单次解码即可达到与先进两阶段推理方法相竞争的性能，显著提升了低资源多语言语音识别应用的解码效率。