In this paper, we show that a simple self-supervised pre-trained audio model can achieve comparable inference efficiency to more complicated pre-trained models with speech transformer encoders. These speech transformers rely on mixing convolutional modules with self-attention modules. They achieve state-of-the-art performance on ASR with top efficiency. We first show that employing these speech transformers as an encoder significantly improves the efficiency of pre-trained audio models as well. However, our study shows that we can achieve comparable efficiency with advanced self-attention solely. We demonstrate that this simpler approach is particularly beneficial with a low-bit weight quantization technique of a neural network to improve efficiency. We hypothesize that it prevents propagating the errors between different quantized modules compared to recent speech transformers mixing quantized convolution and the quantized self-attention modules.
翻译:在本文中,我们表明一种简单的自监督预训练音频模型能够达到与使用语音Transformer编码器的更复杂预训练模型相当的推理效率。这些语音Transformer依赖于将卷积模块与自注意力模块混合使用,它们在语音识别任务上以顶尖效率实现了最先进的性能。我们首先证明,将这些语音Transformer作为编码器使用,同样能显著提升预训练音频模型的效率。然而,我们的研究表明,仅使用先进的自注意力机制也能实现相当的效率。我们论证,这种更简单的方法在采用神经网络低比特权重量化技术以提升效率时尤为有益。我们推测,与近期混合了量化卷积和量化自注意力模块的语音Transformer相比,这种方法能防止不同量化模块之间的误差传播。