Large-scale multilingual ASR (mASR) models such as Whisper achieve strong performance but incur high computational and latency costs, limiting their deployment on resource-constrained edge devices. In this study, we propose a lightweight and language-agnostic multilingual ASR system based on a CTC architecture with domain adaptation. Specifically, we introduce a Language-agnostic Hierarchical LoRA-MoE (HLoRA) framework integrated into an mHuBERT-CTC model, enabling end-to-end decoding via LID-posterior-driven LoRA routing. The hierarchical design consists of a multilingual shared LoRA for learning language-invariant acoustic representations and language-specific LoRA experts for modeling language-dependent characteristics. The proposed routing mechanism removes the need for prior language identity information or explicit language labels during inference, achieving true language-agnostic decoding. Experiments on MSR-86K and the MLC-SLM 2025 Challenge datasets demonstrate that HLoRA achieves comparable performance to two-stage inference approaches while reducing RTF by 11.7% and 8.2%, respectively, leading to improved decoding efficiency for low-resource mASR applications.
翻译:诸如Whisper等大规模多语言自动语音识别(mASR)模型虽能实现强劲性能,但会带来高昂的计算与延迟成本,限制了其在资源受限的边缘设备上的部署。本研究提出了一种基于CTC架构并具备领域自适应能力的轻量级、语言无关的多语言ASR系统。具体而言,我们引入了一种语言无关的分层LoRA-MoE(HLoRA)框架,并将其集成到mHuBERT-CTC模型中,通过LID后验驱动的LoRA路由实现端到端解码。该分层设计包含一个用于学习语言无关声学表征的多语言共享LoRA,以及一组用于建模语言相关特性的语言特定LoRA专家。所提出的路由机制在推理过程中无需先验语言身份信息或显式语言标签,实现了真正的语言无关解码。在MSR-86K和MLC-SLM 2025挑战数据集上的实验表明,HLoRA取得了与两阶段推理方法相当的性能,同时分别将RTF降低了11.7%和8.2%,从而为低资源mASR应用提升了解码效率。