Scale has opened new frontiers in natural language processing, but at a high cost. In response, by learning to only activate a subset of parameters in training and inference, Mixture-of-Experts (MoE) have been proposed as an energy efficient path to even larger and more capable language models and this shift towards a new generation of foundation models is gaining momentum, particularly within the field of Automatic Speech Recognition (ASR). Recent works that incorporating MoE into ASR models have complex designs such as routing frames via supplementary embedding network, improving multilingual ability for the experts, and utilizing dedicated auxiliary losses for either expert load balancing or specific language handling. We found that delicate designs are not necessary, while an embarrassingly simple substitution of MoE layers for all Feed-Forward Network (FFN) layers is competent for the ASR task. To be more specific, we benchmark our proposed model on a large scale inner-source dataset (160k hours), the results show that we can scale our baseline Conformer (Dense-225M) to its MoE counterparts (MoE-1B) and achieve Dense-1B level Word Error Rate (WER) while maintaining a Dense-225M level Real Time Factor (RTF). Furthermore, by applying Unified 2-pass framework with bidirectional attention decoders (U2++), we achieve the streaming and non-streaming decoding modes in a single MoE based model, which we call U2++ MoE. We hope that our study can facilitate the research on scaling speech foundation models without sacrificing deployment efficiency.
翻译:规模扩展虽为自然语言处理开辟了新前沿,但代价高昂。为此,通过仅在训练和推理中激活部分参数的混合专家(MoE)模型,被提出作为一种能效更优的路径,用于构建更大、能力更强的语言模型。这一向新一代基础模型的转型势头正劲,尤其在自动语音识别(ASR)领域。近期将MoE融入ASR模型的工作采用了复杂设计,如通过辅助嵌入网络路由帧、增强专家的多语言能力,以及利用专用辅助损失函数实现专家负载均衡或特定语言处理。本研究发现,精细设计并非必要,而将MoE层简单替换所有前馈网络(FFN)层即可胜任ASR任务。具体而言,我们在大规模内部数据集(16万小时)上对提出模型进行基准测试,结果表明,我们可将基线Conformer(Dense-225M)扩展至其MoE对应版本(MoE-1B),在保持Dense-225M级别实时因子(RTF)的同时,实现Dense-1B级别的词错误率(WER)。此外,通过应用统一双通道框架与双向注意力解码器(U2++),我们在单个基于MoE的模型中实现流式与非流式解码模式,并将其命名为U2++ MoE。希望本研究能促进在不牺牲部署效率的前提下扩展语音基础模型的研究。