Accented speech remains a persistent challenge for automatic speech recognition (ASR), as most models are trained on data dominated by a few high-resource English varieties, leading to substantial performance degradation for other accents. Accent-agnostic approaches improve robustness yet struggle with heavily accented or unseen varieties, while accent-specific methods rely on limited and often noisy labels. We introduce Moe-Ctc, a Mixture-of-Experts architecture with intermediate CTC supervision that jointly promotes expert specialization and generalization. During training, accent-aware routing encourages experts to capture accent-specific patterns, which gradually transitions to label-free routing for inference. Each expert is equipped with its own CTC head to align routing with transcription quality, and a routing-augmented loss further stabilizes optimization. Experiments on the Mcv-Accent benchmark demonstrate consistent gains across both seen and unseen accents in low- and high-resource conditions, achieving up to 29.3% relative WER reduction over strong FastConformer baselines.
翻译:口音语音对自动语音识别(ASR)系统构成持续挑战,因为大多数模型是在以少数高资源英语变体为主的数据上训练的,导致对其他口音的识别性能显著下降。口音无关方法虽能提升鲁棒性,但在处理重口音或未见口音时仍存在困难,而口音特定方法则依赖有限且通常带有噪声的标注数据。本文提出Moe-Ctc,一种结合中间CTC监督的专家混合架构,该架构能同时促进专家的专业化与泛化能力。在训练过程中,口音感知路由机制引导专家学习口音特异性模式,并逐步过渡至无标注路由以进行推理。每个专家配备独立的CTC头部,使路由决策与转录质量对齐,同时引入路由增强损失函数以进一步稳定优化过程。在Mcv-Accent基准测试上的实验表明,该方法在低资源与高资源条件下对已见及未见口音均取得稳定性能提升,相比强大的FastConformer基线模型,相对词错误率最高降低29.3%。