Bayesian Learning for Deep Neural Network Adaptation

A key task for speech recognition systems is to reduce the mismatch between training and evaluation data that is often attributable to speaker differences. Speaker adaptation techniques play a vital role to reduce the mismatch. Model-based speaker adaptation approaches often require sufficient amounts of target speaker data to ensure robustness. When the amount of speaker level data is limited, speaker adaptation is prone to overfitting and poor generalization. To address the issue, this paper proposes a full Bayesian learning based DNN speaker adaptation framework to model speaker-dependent (SD) parameter uncertainty given limited speaker specific adaptation data. This framework is investigated in three forms of model based DNN adaptation techniques: Bayesian learning of hidden unit contributions (BLHUC), Bayesian parameterized activation functions (BPAct), and Bayesian hidden unit bias vectors (BHUB). In the three methods, deterministic SD parameters are replaced by latent variable posterior distributions for each speaker, whose parameters are efficiently estimated using a variational inference based approach. Experiments conducted on 300-hour speed perturbed Switchboard corpus trained LF-MMI TDNN/CNN-TDNN systems suggest the proposed Bayesian adaptation approaches consistently outperform the deterministic adaptation on the NIST Hub5'00 and RT03 evaluation sets. When using only the first five utterances from each speaker as adaptation data, significant word error rate reductions up to 1.4% absolute (7.2% relative) were obtained on the CallHome subset. The efficacy of the proposed Bayesian adaptation techniques is further demonstrated in a comparison against the state-of-the-art performance obtained on the same task using the most recent systems reported in the literature.

翻译：语音识别系统的一个关键任务是减少训练数据与评估数据之间的失配，这种失配通常源于说话人差异。说话人自适应技术在降低失配方面起着至关重要的作用。基于模型的自适应方法通常需要足够的目标说话人数据以确保鲁棒性。当说话人级数据有限时，自适应容易出现过拟合和泛化能力差的问题。为解决该问题，本文提出了一种基于全贝叶斯学习的深度神经网络说话人自适应框架，用于在给定有限说话人特定自适应数据的情况下对说话人相关参数的不确定性进行建模。该框架在三种基于模型的深度神经网络自适应技术中进行了研究：隐单元贡献的贝叶斯学习、贝叶斯参数化激活函数以及贝叶斯隐单元偏置向量。在这三种方法中，确定性的说话人相关参数被替换为每个说话人的潜变量后验分布，其参数通过基于变分推断的方法进行高效估计。在基于300小时语速扰动的Switchboard语料库训练的LF-MMI TDNN/CNN-TDNN系统上进行的实验表明，所提出的贝叶斯自适应方法在NIST Hub5'00和RT03评估集上均持续优于确定性自适应方法。当仅使用每个说话人的前五条语音作为自适应数据时，在CallHome子集上获得了高达1.4%绝对（7.2%相对）的词错误率降低。通过与文献中报道的最新系统在同一任务上获得的最先进性能进行比较，进一步证明了所提出的贝叶斯自适应技术的有效性。