The accuracy of end-to-end (E2E) automatic speech recognition (ASR) models continues to improve as they are scaled to larger sizes, with some now reaching billions of parameters. Widespread deployment and adoption of these models, however, requires computationally efficient strategies for decoding. In the present work, we study one such strategy: applying multiple frame reduction layers in the encoder to compress encoder outputs into a small number of output frames. While similar techniques have been investigated in previous work, we achieve dramatically more reduction than has previously been demonstrated through the use of multiple funnel reduction layers. Through ablations, we study the impact of various architectural choices in the encoder to identify the most effective strategies. We demonstrate that we can generate one encoder output frame for every 2.56 sec of input speech, without significantly affecting word error rate on a large-scale voice search task, while improving encoder and decoder latencies by 48% and 92% respectively, relative to a strong but computationally expensive baseline.
翻译:端到端自动语音识别模型的准确率随模型规模扩大而持续提升,部分模型参数已达数十亿量级。然而,要实现这些模型的广泛部署与应用,需开发计算高效的解码策略。本研究聚焦于一种关键策略:在编码器中应用多层帧压缩模块,将编码器输出压缩为少量输出帧。虽然已有研究探索过类似技术,但本研究通过采用多层漏斗式压缩层,实现了远超以往报道的压缩率。通过消融实验,我们系统分析了编码器中多种架构选择的影响,以确定最优策略。实验表明,在无需显著影响大规模语音搜索任务词错误率的前提下,我们可将每2.56秒输入语音生成一个编码器输出帧,同时相较于计算开销高昂的强基线模型,编码器与解码器延迟分别降低48%和92%。