In the era of large models, the autoregressive nature of decoding often results in latency serving as a significant bottleneck. We propose a non-autoregressive LM-fused ASR system that effectively leverages the parallelization capabilities of accelerator hardware. Our approach combines the Universal Speech Model (USM) and the PaLM 2 language model in per-segment scoring mode, achieving an average relative WER improvement across all languages of 10.8% on FLEURS and 3.6% on YouTube captioning. Furthermore, our comprehensive ablation study analyzes key parameters such as LLM size, context length, vocabulary size, fusion methodology. For instance, we explore the impact of LLM size ranging from 128M to 340B parameters on ASR performance. This study provides valuable insights into the factors influencing the effectiveness of practical large-scale LM-fused speech recognition systems.
翻译:在大模型时代,解码的自回归特性常导致延迟成为显著瓶颈。我们提出了一种非自回归语言模型融合的自动语音识别(ASR)系统,能够有效利用加速器硬件的并行化能力。该方法将Universal Speech Model (USM)与PaLM 2语言模型以每片段评分模式结合,在FLEURS数据集上实现了所有语言平均相对词错误率(WER)降低10.8%,在YouTube字幕任务上降低3.6%。此外,我们的全面消融实验分析了关键参数,如大语言模型(LLM)规模、上下文长度、词汇量大小及融合方法。例如,我们探究了从128M到340B参数的LLM规模对ASR性能的影响。本研究为影响实用大规模LM融合语音识别系统有效性的因素提供了宝贵见解。