Conformer-based models have become the most dominant end-to-end architecture for speech processing tasks. In this work, we propose a carefully redesigned Conformer with a new down-sampling schema. The proposed model, named Fast Conformer, is 2.8x faster than original Conformer, while preserving state-of-the-art accuracy on Automatic Speech Recognition benchmarks. Also we replace the original Conformer global attention with limited context attention post-training to enable transcription of an hour-long audio. We further improve long-form speech transcription by adding a global token. Fast Conformer combined with a Transformer decoder also outperforms the original Conformer in accuracy and in speed for Speech Translation and Spoken Language Understanding.
翻译:基于 Conformer 的模型已成为语音处理任务中最主流的端到端架构。本研究提出一种精心重新设计的 Conformer,并引入新的降采样方案。所提出的模型名为 Fast Conformer,速度比原始 Conformer 快 2.8 倍,同时在自动语音识别基准测试中保持最先进的准确率。此外,我们采用训练后受限上下文注意力替换原始 Conformer 的全局注意力,以实现长达一小时的音频转录。通过添加全局表征,我们进一步提升了长语音转录性能。结合 Transformer 解码器的 Fast Conformer 在语音翻译和口语理解任务中,其准确率和速度均优于原始 Conformer。