Recent developments using End-to-End Deep Learning models have been shown to have near or better performance than state of the art Recurrent Neural Networks (RNNs) on Automatic Speech Recognition tasks. These models tend to be lighter weight and require less training time than traditional RNN-based approaches. However, these models take frequentist approach to weight training. In theory, network weights are drawn from a latent, intractable probability distribution. We introduce BayesSpeech for end-to-end Automatic Speech Recognition. BayesSpeech is a Bayesian Transformer Network where these intractable posteriors are learned through variational inference and the local reparameterization trick without recurrence. We show how the introduction of variance in the weights leads to faster training time and near state-of-the-art performance on LibriSpeech-960.
翻译:近期采用端到端深度学习模型的研究表明,在自动语音识别任务中,其性能已接近或超越最先进的循环神经网络(RNN)。这些模型通常比传统基于RNN的方法更轻量,且训练时间更短。然而,这些模型采用频率学派方法进行权重训练。理论上,网络权重来源于一个潜在且难以处理的概率分布。我们提出了用于端到端自动语音识别的BayesSpeech模型。BayesSpeech是一种贝叶斯Transformer网络,其中这些难以处理的后验概率通过变分推断和局部重参数化技巧(无循环结构)进行学习。我们证明了权重中的方差引入如何加速训练,并在LibriSpeech-960数据集上实现接近最先进的性能。