This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system, designed to meet the requirements of large-scale, multilingual ASR serving various application needs. Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages. We provide a detailed description of our model architecture, consisting of a full-context 600M-parameter Conformer encoder pre-trained with BEST-RQ and an RNN-T decoder fine-tuned jointly with the encoder. Our extensive evaluation demonstrates competitive word error rates (WERs) against larger and more computationally expensive models, such as Whisper large and Canary-1B. Furthermore, our architectural choices yield several key advantages, including an improved code-switching capability, a 5x inference speedup compared to an optimized Whisper baseline, a 30% reduction in hallucination rate on speech data, and a 90% reduction in ambient noise compared to Whisper, along with significantly improved time-stamp accuracy. Throughout this work, we adopt a system-centric approach to analyzing various aspects of fully-fledged ASR models to gain practically relevant insights useful for real-world services operating at scale.
翻译:本文描述了AssemblyAI开发的工业级自动语音识别(ASR)系统,旨在满足大规模多语言ASR服务多样化应用的需求。我们的系统采用包含四种语言的无监督数据(1250万小时)、监督数据(18.8万小时)和伪标签数据(160万小时)的多样化训练数据集。我们详细阐述了模型架构:采用BEST-RQ预训练的6亿参数全上下文Conformer编码器,配合与编码器联合微调的RNN-T解码器。广泛评估表明,本系统在词错误率(WER)上可与Whisper large和Canary-1B等更大规模、更高计算成本的模型相媲美。此外,我们的架构设计带来多项关键优势:包括更优的语码转换能力、相比优化后的Whisper基线5倍的推理加速、语音数据幻觉率降低30%、环境噪声相比Whisper降低90%,以及时间戳精度显著提升。本研究采用系统中心化方法,全面剖析完整ASR模型的各个维度,旨在获取对规模化运营的实际服务具有实践指导意义的洞察。