In this paper, we show that a simple self-supervised pre-trained audio model can achieve comparable inference efficiency to more complicated pre-trained models with speech transformer encoders. These speech transformers rely on mixing convolutional modules with self-attention modules. They achieve state-of-the-art performance on ASR with top efficiency. We first show that employing these speech transformers as an encoder significantly improves the efficiency of pre-trained audio models as well. However, our study shows that we can achieve comparable efficiency with advanced self-attention solely. We demonstrate that this simpler approach is particularly beneficial with a low-bit weight quantization technique of a neural network to improve efficiency. We hypothesize that it prevents propagating the errors between different quantized modules compared to recent speech transformers mixing quantized convolution and the quantized self-attention modules.
翻译:本文证明,一个简单的自监督预训练音频模型能够达到与采用语音Transformer编码器的复杂预训练模型相当的推理效率。这些语音Transformer依赖于将卷积模块与自注意力模块结合,在自动语音识别(ASR)任务上以最高效率实现了最先进的性能。我们首先表明,采用这些语音Transformer作为编码器同样能显著提升预训练音频模型的效率。然而,我们的研究显示,仅使用先进的自注意力机制即可达到同等效率。我们论证了这种更简单的方法在结合低比特权重量化技术时尤为有益,能够提升神经网络效率。我们推测,与近期将量化卷积与量化自注意力模块混合的语音Transformer相比,该方法能防止不同量化模块间的误差传播。