Conformer models maintain a large number of internal states, the vast majority of which are associated with self-attention layers. With limited memory bandwidth, reading these from memory at each inference step can slow down inference. In this paper, we design an optimized conformer that is small enough to meet on-device restrictions and has fast inference on TPUs. We explore various ideas to improve the execution speed, including replacing lower conformer blocks with convolution-only blocks, strategically downsizing the architecture, and utilizing an RNNAttention-Performer. Our optimized conformer can be readily incorporated into a cascaded-encoder setting, allowing a second-pass decoder to operate on its output and improve the accuracy whenever more resources are available. Altogether, we find that these optimizations can reduce latency by a factor of 6.8x, and come at a reasonable trade-off in quality. With the cascaded second-pass, we show that the recognition accuracy is completely recoverable. Thus, our proposed encoder can double as a strong standalone encoder in on device, and as the first part of a high-performance ASR pipeline.
翻译:Conformer模型维持大量内部状态,其中绝大多数与自注意力层相关。在有限的内存带宽下,每次推理步骤从内存中读取这些状态会拖慢推理速度。本文设计了一种优化后的Conformer,其尺寸足够小以满足设备端限制,并能在TPU上实现快速推理。我们探索了多种提升执行速度的方案,包括用纯卷积模块替换底层Conformer模块、策略性缩小架构尺寸,以及采用RNNAttention-Performer。优化后的Conformer可轻松集成至级联编码器配置中,使得在资源可用时第二遍解码器能基于其输出运行并提升准确率。综合来看,这些优化可将延迟降低6.8倍,且质量损失在合理范围内。通过级联第二遍处理,我们证明识别准确率可完全恢复。因此,我们提出的编码器既能作为设备端强大的独立编码器,也能作为高性能ASR流水线的首段组件。