We present a decoder-only Conformer for automatic speech recognition (ASR) that processes speech and text in a single stack without external speech encoders or pretrained large language models (LLM). The model uses a modality-aware sparse mixture of experts (MoE): disjoint expert pools for speech and text with hard routing and top-1 selection, embedded in hybrid-causality Conformer blocks (bidirectional for speech, causal for text). Training combines CTC on speech positions with label-smoothed cross-entropy for text generation. Our 113M-parameter model consistently improves WER over a 139M AED baseline on Librispeech (2.8% vs. 3.2% test-clean; 5.6% vs. 6.0% test-other). On Common Voice 16.1 with a single multilingual model across five languages, our approach reduces average WER from 12.2% to 10.6%. To our knowledge, this is the first randomly initialized decoder-only ASR that surpasses strong AED baselines via modality-aware routing and sparse MoE, achieving better accuracy with fewer active parameters and without alignment/adaptation modules.
翻译:本文提出一种仅解码器的Conformer架构用于自动语音识别(ASR),该架构在单一堆栈中处理语音和文本,无需外部语音编码器或预训练大语言模型(LLM)。模型采用模态感知的稀疏专家混合(MoE)机制:通过硬路由和top-1选择策略为语音和文本配置独立专家池,并嵌入具有混合因果性的Conformer模块中(语音采用双向注意力,文本采用因果注意力)。训练过程结合语音位置的CTC损失与文本生成的标签平滑交叉熵损失。我们构建的1.13亿参数模型在Librispeech数据集上持续优于1.39亿参数的AED基线模型(test-clean集词错误率2.8% vs. 3.2%;test-other集5.6% vs. 6.0%)。在Common Voice 16.1数据集的五语言多语种任务中,单一模型将平均词错误率从12.2%降至10.6%。据我们所知,这是首个通过模态感知路由与稀疏MoE机制超越强AED基线的随机初始化仅解码器ASR模型,该模型以更少的激活参数、无需对齐/适配模块实现了更优的识别精度。