While Speech Large Language Models (Speech-LLMs) have achieved strong performance on adult Automatic Speech Recognition (ASR), their effectiveness on child speech remains under-explored, and single models often struggle to handle diverse adult and child age groups simultaneously. This paper proposes a Mixture-of-Experts (MoE) Speech-LLM for unified ASR across adult and child speech spanning diverse environments and age groups. The framework employs a Classifier-based Domain Router (C-DR) with a coarse-to-fine strategy and integrates both a Mixture-of-Projectors (MoP) and a Mixture-of-LoRAs (MoL) to model domain-specific variations. To address routing uncertainty near domain boundaries, an Entropy-Aware Routing (EAR) mechanism is introduced to dynamically incorporate a shared expert. Experiments on public child corpora demonstrate consistent improvements over baselines while preserving adult ASR performance. To our knowledge, this is the first work leveraging Speech-LLMs for unified, multi-domain ASR encompassing both children and adults.
翻译:尽管语音大语言模型在成人自动语音识别任务中表现强劲,但其在儿童语音上的有效性仍待探索,单一模型常难以同时应对成人及不同年龄段儿童语音的多样性。本文提出一种混合专家语音-大语言模型,旨在统一处理涵盖多种环境与年龄段的成人及儿童语音识别。该框架采用基于分类器的域路由器,结合由粗到精策略,并集成混合投影器与混合LoRA模块以建模领域特异性变化。针对域边界附近的决策不确定性,引入熵感知路由机制动态融合共享专家。在公开儿童语料库上的实验表明,该方法在保持成人语音识别性能的同时,持续优于基线模型。据我们所知,这是首项利用语音-大语言模型实现跨儿童与成人的统一多领域语音识别研究。