Conformer-based models have become the most dominant end-to-end architecture for speech processing tasks. In this work, we propose a carefully redesigned Conformer with a new down-sampling schema. The proposed model, named Fast Conformer, is 2.8x faster than original Conformer, while preserving state-of-the-art accuracy on Automatic Speech Recognition benchmarks. Also we replace the original Conformer global attention with limited context attention post-training to enable transcription of an hour-long audio. We further improve long-form speech transcription by adding a global token. Fast Conformer combined with a Transformer decoder also outperforms the original Conformer in accuracy and in speed for Speech Translation and Spoken Language Understanding.
翻译:基于Conformer的模型已成为语音处理任务中最主流的端到端架构。本文提出了一种经过精心重新设计的Conformer模型,并引入新的下采样方案。所提出的模型名为Fast Conformer,在自动语音识别基准测试中保持最先进准确率的同时,其运行速度比原始Conformer快2.8倍。此外,我们采用训练后有限上下文注意力机制替代原始Conformer的全局注意力机制,从而实现对长达一小时音频的转录。我们通过添加全局标记进一步改进长语音转录。与Transformer解码器结合的Fast Conformer在语音翻译和口语理解任务中,其准确率和速度均优于原始Conformer。