SqueezeFormer has recently shown impressive performance in automatic speech recognition (ASR). However, its inference speed suffers from the quadratic complexity of softmax-attention (SA). In addition, limited by the large convolution kernel size, the local modeling ability of SqueezeFormer is insufficient. In this paper, we propose a novel method HybridFormer to improve SqueezeFormer in a fast and efficient way. Specifically, we first incorporate linear attention (LA) and propose a hybrid LASA paradigm to increase the model's inference speed. Second, a hybrid neural architecture search (NAS) guided structural re-parameterization (SRep) mechanism, termed NSR, is proposed to enhance the ability of the model to extract local interactions. Extensive experiments conducted on the LibriSpeech dataset demonstrate that our proposed HybridFormer can achieve a 9.1% relative word error rate (WER) reduction over SqueezeFormer on the test-other dataset. Furthermore, when input speech is 30s, the HybridFormer can improve the model's inference speed up to 18%. Our source code is available online.
翻译:SqueezeFormer近期在自动语音识别(ASR)任务中展现出卓越性能。然而,其推理速度受限于softmax注意力机制的二次复杂度问题;同时,由于大卷积核尺寸的限制,SqueezeFormer的局部建模能力存在不足。本文提出一种名为HybridFormer的新型方法,以快速高效的方式改进SqueezeFormer。具体而言,我们首先引入线性注意力(LA)机制,并提出混合LASA范式以提升模型推理速度;其次,提出一种基于混合神经架构搜索(NAS)的结构重参数化(SRep)机制(简称NSR),用于增强模型提取局部交互特征的能力。在LibriSpeech数据集上的大量实验表明,相比SqueezeFormer,本文提出的HybridFormer在test-other数据集上可实现9.1%的相对词错误率(WER)降低。此外,当输入语音时长为30秒时,HybridFormer可将模型推理速度提升高达18%。我们的源代码已在线上公开。